Data subsampling of Non-Negative Matrix Factorization


I made the model for NMF:

X ~ Normal(W*H, sigma=1)
W ~ Exponential(lambda_w=1)
H ~ Exponential(lambda_h=1)

The code in edward like this:

import numpy as np
import matplotlib.pyplot as plt
import edward as ed
import tensorflow as tf
from edward.models import Gamma, Poisson, Normal, PointMass, TransformedDistribution, Exponential

sample = 100
feature = 12

def dataset_generation(sample=100,feature=10,K=5):
    w = np.repeat(1, (0.75*sample)//K)
    h = np.repeat(1, feature//K)

    W = np.zeros((sample, K))
    H = np.zeros((K ,feature))

    for i in range(K):
    X = W @ H 
    XX =  X *  np.random.normal(6, 1, size= X.shape)
    XXX = XX.copy()

    # changed every point in datasets which has value 0 
    XXX[np.where(X==0)] = np.random.normal( 2,1, size= len(XXX[np.where(X==0)]))
    #change negative value to zero
    XXX[np.where(XXX<=0)] =  np.random.uniform( 0.0001,0.1, size= len(XXX[np.where(XXX<=0)]))
    return XXX
X = dataset_generation(sample=sample,feature=feature,K=K)


def lognormal_q(loc,scale):
    min_scale = 1e-5
    rv = TransformedDistribution(
        distribution=Normal(loc, tf.maximum(tf.nn.softplus(scale), min_scale)),
    return rv

X = X.astype(np.float32)
W = Exponential(rate=tf.ones([sample, K]))
H = Exponential(rate=tf.ones([K, feature])*0.1)

V = Normal(
    loc=tf.matmul(W, H),
    scale=tf.ones([sample, feature]))

qW_loc = tf.Variable(tf.random_normal([sample, K]))
qW_scale = tf.Variable(tf.random_normal([sample, K],stddev=0.1))
qH_loc = tf.Variable(tf.random_normal([K, feature]))
qH_scale = tf.Variable(tf.random_normal([K, feature],stddev=0.1))
qW = lognormal_q(loc=qW_loc,scale=qW_scale)
qH = lognormal_q(loc=qH_loc,scale=qH_scale)

inference = ed.KLqp({W: qW, H: qH}, data={V: X})
optimizer = tf.train.AdamOptimizer()
inference.initialize(n_iter = 30000,optimizer=optimizer)

loss = np.empty(inference.n_iter, dtype=np.float32)
for t in range(inference.n_iter):
    info_dict = inference.update()
    loss[t] = info_dict["loss"]


w = qW.sample(1000).eval()
h = qH.sample(1000).eval()

w= np.mean(w,axis=0)
h= np.mean(h,axis=0)

fig,axes = plt.subplots(1,3,figsize=(12,3))

It is working pretty well but the data that I have is massive enough for my computer. So I perform data subsampling based on the tutorial in edward site :

batch_sample = 50
def generator(arrays, batch_size):
    """Generate batches, one with respect to each array's first axis."""
    starts = [0] * len(arrays)  # pointers to where we are in iteration
    while True:
        batches = []
        for i, array in enumerate(arrays):
            start = starts[i]
            stop = start + batch_size
            diff = stop - array.shape[0]
            if diff <= 0:
                batch = array[start:stop]
                starts[i] += batch_size
                batch = np.concatenate((array[start:], array[:diff]))
                starts[i] = diff
        yield batches

batch_sample = 50
W = Exponential(rate=tf.ones([batch_sample, K]))
H = Exponential(rate=tf.ones([K, feature]) * 0.1)
X = X.astype(np.float32)
X_ph = tf.placeholder(tf.float32, [batch_sample, feature])

qW_loc = tf.Variable(tf.random_normal([batch_sample, K]))
qW_scale = tf.Variable(tf.random_normal([batch_sample, K],stddev=0.1))
qH_loc = tf.Variable(tf.random_normal([K, feature]))
qH_scale = tf.Variable(tf.random_normal([K, feature],stddev=0.1))
qW = lognormal_q(qW_loc,qW_scale)
qH = lognormal_q(qH_loc,qH_scale)

data = generator([X], batch_sample)
V = Normal(loc=tf.matmul(W, H), scale=tf.ones([batch_sample, feature]))
inference_local = ed.KLqp({W: qW}, data={V: X_ph, H: qH})
inference_global = ed.KLqp({H: qH}, data={V: X_ph, W: qW})

    V: float(sample) / batch_sample,
    W: float(sample) / batch_sample
    V: float(sample) / batch_sample,
    W: float(sample) / batch_sample

qw_init = tf.variables_initializer([qW_loc,qW_scale])
loss = np.empty(inference.n_iter, dtype=np.float32)

for t in range(2000):
    x_batch = next(data)[0]
    for _ in range(30):
    qw_init = tf.variables_initializer([qW_loc,qW_scale])

H_ = np.mean(qH.sample(5000).eval(),axis = 0)
W_ = np.mean(qW.sample(5000).eval(),axis = 0)

The result is not good. Is there something wrong with the subsampling model?



I’ve been working on developing a factor model in Edward for big data. Your query was quite timely because I’ve thought about how to assess the performance of data subsampling methods. I’ve taken Justin’s matrix factorization model and his batch learning example to develop a simple subsampling approach for a relatively large dataset based on Justin’s matrix factorization model. I have calculated the average root mean squared error across batch samples and degree of uncertainty. Let me know where to send the Jupyter notebook containing the code. I hope this helps you!



Thank you so much. I thought my question will be never answered.
If it is possible, I think it is better to post it here or post the link to your notebook here so many people can learn it.

But if it is not possible or rather confidential, you can send it to my email.


This is code from Mark that he sent me by email. I repost it here by his permission

# coding: utf-8

# In[1]:

import numpy as np
import matplotlib.pyplot as plt
import edward as ed
import tensorflow as tf

from sklearn.metrics import mean_squared_error
from math import sqrt
from edward.models import Normal

# In[2]:

def build_toy_dataset(U, V, N, M, noise_std=0.1):
    R =, V) + np.random.normal(0, noise_std, size=(N, M))
    return R

N = 10000 # number of observations
M = 12000 # number of features
D = 3     # number of factors
K = 128   # batch size during training


# true latent factors
U_true = np.random.randn(D, N)
V_true = np.random.randn(D, M)
mu_true =, V_true)

X_true = build_toy_dataset(U_true, V_true, N, M)

# In[3]:

fig,axes = plt.subplots(1, 3, figsize = (14, 3))


# In[4]:

def next_batch(x, K):
            N = x.shape[0]
            idx_batch = np.random.choice(N, K)
            return x[idx_batch, :], idx_batch

# In[5]:


U = Normal(loc=0.0, scale=1.0, sample_shape=[D, K])
V = Normal(loc=0.0, scale=1.0, sample_shape=[D, M])
X = Normal(loc=tf.matmul(tf.transpose(U), V), scale=tf.ones([K, M]))

# In[6]:

n_epoch = 1
n_batch = int(N / K)
n_iter = n_epoch * n_batch * 10

idx_ph = tf.placeholder(tf.int32, K)
x_ph = tf.placeholder(tf.float32, [K, M])

qU = Normal(loc=tf.get_variable("qU/loc", [D, K]),
                  tf.get_variable("qU/scale", [D, K])))

qV = Normal(loc=tf.get_variable("qV/loc", [D, M]),
                  tf.get_variable("qV/scale", [D, M])))

inference = ed.KLqp({U: qU, V: qV}, data={X: x_ph})
scale_factor = float(N) / K
inference.initialize(n_iter = n_iter, scale={X: scale_factor, U: scale_factor}, n_samples=5)

sess = ed.get_session()

rmse_array = np.array([])

for _ in range(n_iter // 10):
    X_batch, idx_batch = next_batch(X_true, K)
    for _ in range(10):
        info_dict = inference.update({x_ph: X_batch, idx_ph: idx_batch})
        t = info_dict['t']
        if t == 1 or t % inference.n_print == 0: 
            U_mean =, feed_dict={x_ph: X_true[:K, :], idx_ph: np.arange(K)})
            V_mean = 
            mu_est =, V_mean)
            rmse = sqrt(mean_squared_error(mu_true[:K, :], mu_est))
            rmse_array = np.append(rmse_array, rmse)

# In[7]:

print("Mean rmse across batches: {:0.2f} +/- {:0.2f}".format(rmse_array.mean(), rmse_array.std()))

# In[8]:

fig,axes = plt.subplots(1, 2, figsize = (12, 3))

axes[0].pcolormesh(np.asarray(mu_true[:K, :]),cmap="seismic")

Thank you Mark


Thank you Mark for your answer. I have some question regarding your answer,

in this code why you do feed_dict? Variable qU does neither have parameter x_ph nor idx_ph.

Why you perform scaling in both X and U? As I understand it, we just need to scale X.


Hi yusri,

Recall we are subsampling along the rows in mini batches of size K while keeping the columns fixed (that is, M is constant). The matrices U and X share the same number of rows, and so we have to scale both matrices.

I see your point regarding the first question! It is redundant. I was thinking of how to get the predicted full matrix U, which is N by M. I see why I couldn’t get it to work correctly. I needed to rewrite the sampling distribution qU such that you can keep track of the indices; this is necessary for the construction of the predicted full matrix U from the smaller predicted U matrices.

I hope that helps!



Ah, I understand now. So U actually is not a full matrix. So, If I want to get the full matrix must I collect it every loop? Do you have insight how to combine the U?