Dirichlet Process Mixture Models (DPMM) with hierarchical structure

Dear Edward users,

I would like to implement a Dirichlet Process Mixture Model (DPMM) and use different inference methods on it, like MCMC and VI. I would like to use the model to cluster 2d data points. First, I wanted to start with a rather simple Gaussian Mixture Model (GMM) and then extend the model to have some hierarchical structure (related to the HDP by Teh, but not exactly the same).

Now I am struggling with the question how to implement this my model in Edward correctly.

On the one hand, there is the DirichletProcess class, which sounds really promising, but I don’t know how to use it correctly. I could define the cluster means as the samples from a DirichletProcess-distributed variable. But then I still don’t know the weights for these clusters and do not know the assignments (usually called z). The assignments are crucial for my model, since I need to pass them to the next level of my model. Is there a way to use this DirichletProcess class and pass on the cluster assignments somehow?

On the other hand, I saw the implementation of a DPM for clustering here Unfortunately, when I try to use it, the cluster means change not as they should do and just stay at [0,0].

I am thankful for all tips, or hints what I might be doing wrong.

Thanks a lot!!!

I’m also interested in implementing Hierarchical Dirichlet Process models in Edward. I’m starting with Gaussian DPMM. I tried the example mentioned here, however, I get means that are very close to zero and the posterior predictive is concentrated near zero.

I went back to the unsupervised learning tutorial and added stick-breaking. It uses ParamMixture class and Gibbs sampling. Here’s the complete code:

import numpy as np
import pandas as pd

import edward as ed
import tensorflow as tf

import seaborn as sns
import matplotlib.cm as cm
import matplotlib.pyplot as plt

from scipy.stats import mode
from edward.models import Gamma, Beta
from edward.models import Categorical, Dirichlet, Empirical, Normal
from edward.models import InverseGamma, MultivariateNormalDiag, ParamMixture

def generate_data(N):
    pi = np.array([0.4, 0.6])
    mus = [[1,1], [-1,-1]]
    stds = [[0.1,0.1],[0.1,0.1]]
    x = np.zeros((N,2), dtype=np.float32)
    for n in range(N):
        k = np.argmax(np.random.multinomial(1, pi))
        x[n,:] = np.random.multivariate_normal(mus[k], np.diag(stds[k]))

    return x


N=500 #number of data points
K=8   #number of components
D=2   #dimension

x_train = generate_data(N)

with tf.name_scope("model"):
    alpha = Gamma(concentration=1.0, rate=1.0, name='alpha')
    beta = Beta(concentration1=1.0, concentration0=tf.ones(K)*alpha)
    pi = tf.concat([tf.reshape(beta[0],[1]), tf.reshape(tf.multiply(beta[1:],tf.cumprod(1 - beta[:-1])), [K-1])], 0)

    mu = Normal(loc=tf.zeros(D), scale=tf.ones(D), sample_shape=K)
    sigmasq = InverseGamma(concentration=tf.ones(D), rate=tf.ones(D), sample_shape=K)
    x = ParamMixture(pi, {'loc': mu, 'scale_diag': tf.sqrt(sigmasq)},
                     MultivariateNormalDiag, sample_shape=N)
    z = x.cat

with tf.name_scope("posterior"):
    T = 1000 #number of MCMC samples
    qmu = Empirical(tf.Variable(tf.zeros([T,K,D])))
    qsigmasq = Empirical(tf.Variable(tf.ones([T,K,D])))
    qz = Empirical(tf.Variable(tf.zeros([T,N], dtype=tf.int32)))

with tf.name_scope("inference"):
    inference = ed.Gibbs({mu: qmu, sigmasq: qsigmasq, z: qz},
                         data={x: x_train})

    sess = ed.get_session()

    t_ph = tf.placeholder(tf.int32, [])
    running_cluster_means = tf.reduce_mean(qmu.params[:t_ph], 0)

    for i in range(inference.n_iter):
        info_dict = inference.update()
        t = info_dict['t']
        if t % inference.n_print == 0:
            print "\ninferred cluster means: "
            print sess.run(running_cluster_means, {t_ph: t-1})
        #end if
    #end for

print "computing cluster assignments..."
#compute likelihood for each data point averaged over many posterior samples
#x_post has shape (N, 100, K, D)
post_num_samples = 500
mu_sample = qmu.sample(post_num_samples)
sigmasq_sample = qsigmasq.sample(post_num_samples)
x_post = Normal(loc=tf.ones([N, 1, 1, 1]) * mu_sample,
                scale=tf.ones([N, 1, 1, 1]) * tf.sqrt(sigmasq_sample))
x_broadcasted = tf.tile(tf.reshape(x_train, [N, 1, 1, D]), [1, post_num_samples, K, 1])

#sum over latent dimension, then average over posterior samples
#log_liks final shape is (N, K)
log_liks = x_post.log_prob(x_broadcasted)
log_liks = tf.reduce_mean(log_liks, 3)
log_liks = tf.reduce_mean(log_liks, 1)

#choose cluster with the highest likelihood
assignments = tf.argmax(log_liks, 1).eval()

#thin = 2 
#burnin = 100
#qz_trace = qz.params[burnin::thin].eval()
#assignments, _ = mode(qz_trace, axis=0) 
#assignments = qz.mean().eval() 
pi_dist, _ = np.histogram(assignments, bins=range(K+1))
pi_dist = pi_dist/np.sum(pi_dist, dtype=np.float32)

#compute posterior means
#qmu_mean = qmu.mean().eval()
qmu_mean_samples = tf.stack([qmu.sample() for _ in range(post_num_samples)])
qmu_mean_samples_avg = np.mean(qmu_mean_samples.eval(), axis=0)

#posterior predictive check
print "posterior predictive check..."
#x_post_samples = tf.reduce_mean(x_post, 2)
#x_post_samples = tf.reduce_mean(x_post_samples, 1)
#x_post_samples_avg = x_post_samples.eval()

x_post2 = ed.copy(x, {mu: qmu, sigmasq: qsigmasq, z: qz})
x_post2_samples = tf.stack([x_post2.sample() for _ in range(post_num_samples)])
x_post2_samples_avg = x_post2_samples.eval() 

#generate plots
plt.scatter(x_train[:,0], x_train[:,1], c=assignments, cmap=cm.bwr)
plt.axis([-3, 3, -3, 3])
plt.title('DP cluster assignments')

plt.scatter(x_post2_samples_avg[:,0], x_post2_samples_avg[:,1])
plt.scatter(qmu_mean_samples_avg[:,0], qmu_mean_samples_avg[:,1], s=50, marker='x', color='r')
#plt.axis([-3, 3, -3, 3])
plt.title("Posterior Predictive Check")

plt.bar(range(K), pi_dist, color='r', label='pi_k')
plt.title('mixture proportions')

And the resulting figures:

The assignments (left plot) somewhat make sense. Although there are multiple ways of computing the assignments: based on log_prob of posterior (used in this code as well as the tutorial), samples of the mean of qz (will give different results) and mode of qz trace after some burn in. Not sure which is the best way of computing the assignments. The mixture proportions (middle plot) is a normalized histogram of assignments, that basically indicates that out of K=8 initial clusters the maximum log_prob was assigned to only three. Finally, the posterior predictive check (right plot) shows data generated from the posterior mixture model. I would expect it to look close to the original data (left plot).

I appreciate any comments and suggestions on computing the assignments, posterior predictive check or perhaps using a different inference algorithm or mixture model class. Are there any ways in which the code could be improved?

1 Like

Hi @vsmolyakov

One of the problems in the example you mentioned is that the distributions are not well defined. Maybe an updated version of the code for a fixed K could help. It is based on the examples posted by @ecosang at Github. See also the discussion.

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import edward as ed
import tensorflow as tf
from tensorflow.contrib.linalg import LinearOperatorTriL
ds = tf.contrib.distributions

import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse
import matplotlib.cm as cm
import six


from edward.models import Dirichlet, Categorical, \
MultivariateNormalTriL, WishartCholesky, Mixture

def plot_point_cov(points, nstd=2, ax=None, **kwargs):
        A matplotlib ellipse artist
    pos = points.mean(axis=0)
    cov = np.cov(points, rowvar=False)
    return plot_cov_ellipse(cov, pos, nstd, ax, **kwargs)

def plot_cov_ellipse(cov, pos, nstd=2, ax=None, **kwargs):
        A matplotlib ellipse artist
    def eigsorted(cov):
        vals, vecs = np.linalg.eigh(cov)
        order = vals.argsort()[::-1]
        return vals[order], vecs[:,order]
    if ax is None:
        ax = plt.gca()
    vals, vecs = eigsorted(cov)
    theta = np.degrees(np.arctan2(*vecs[:,0][::-1]))
    # Width and height are "full" widths, not radius
    width, height = 2 * nstd * np.sqrt(vals)
    ellip = Ellipse(xy=pos, width=width, height=height, angle=theta, **kwargs)
    return ellip

# -- Dataset --

N = 5000  # number of data points
K = 3  # number of components
D = 2  # dimensionality of data

true_pi = np.array([0.3, 0.6, 0.1])
true_mus = [[10, 1], [-10, -10], [-1, 20]]
true_stds = [[[3, -0.1], [-0.1, 2]], 
[[4, 0.0], [0.0, 1]], [[3, 0.2], [0.2, 4]]]
true_cholesky = tf.cholesky(true_stds)

def build_toy_dataset(N,pi,mus,stds):
  x = np.zeros((N, 2), dtype=np.float32)
  label = np.zeros((N,),dtype=np.float32)
  for n in range(N):
    k = np.argmax(np.random.multinomial(1, pi))
    x[n, :] = np.random.multivariate_normal(mus[k], stds[k])
    label[n] = k
  return x, label

xn_data, xn_label = build_toy_dataset(N,true_pi, true_mus, true_stds)

pi = Dirichlet(tf.ones(K))  # Weights
z = Categorical(probs=pi,sample_shape=N) # Assignments

nu0 = tf.Variable(D, dtype=tf.float32, trainable=False)
psi0 = tf.Variable(np.eye(D), dtype=tf.float32, trainable=False)
mu0 = tf.Variable(np.zeros(D), dtype=tf.float32, trainable=False)
k0 = tf.Variable(1., dtype=tf.float32, trainable=False)

sigma = WishartCholesky(df=nu0, scale=psi0,cholesky_input_output_matrices=True, sample_shape = K)
mu = MultivariateNormalTriL(mu0, k0*sigma)
components = [MultivariateNormalTriL(mu[k],sigma[k]) for k in range(K)]
xn = Mixture(cat=z,components=components,sample_shape=N)

qpsi0 = tf.Variable(tf.random_normal([K, D, D], dtype=tf.float32))
Ltril = LinearOperatorTriL(ds.matrix_diag_transform(qpsi0, transform=tf.nn.softplus)).to_dense()
qsigma = WishartCholesky(df=[100]*K,scale=Ltril) #,cholesky_input_output_matrices=True)

qmu0 = tf.Variable(tf.random_normal([K, D], dtype=tf.float32))
qR = tf.Variable(tf.random_normal([K, D, D], dtype=tf.float32))
qmu = MultivariateNormalTriL(qmu0, qR)

inference = ed.KLqp({mu: qmu, sigma: qsigma}, data={xn: xn_data})
inference.initialize(n_iter=1000, n_print=100, n_samples=30)
sess = ed.get_session()
init = tf.global_variables_initializer()

learning_curve = []
for _ in range(inference.n_iter):
    info_dict = inference.update()
    if _%30 == 0:


post_mu_mean = qmu.sample(100).eval().mean(axis=0)
post_sigma_mean= qsigma.sample(100).eval().mean(axis=0)

for i in range(K):
  plot_cov_ellipse(true_stds[i], true_mus[i], nstd=3, alpha=0.3, color = "green")
  plot_cov_ellipse(post_sigma_mean[i], post_mu_mean[i], nstd=3, alpha=0.3, color = "red")

plt.scatter(xn_data[:, 0], xn_data[:, 1])
1 Like

Hello. I’d like to ask if anyone managed to get a working version of the Hierarchical DPMM. Thank.s

1 Like

Sorry I’m relatively new to Bayesian statistics. I’m interested in using Edward for work related to the Teh 2006 paper (Pitman-Yor/hierarchical Dirichlet (Chinese Restaurant) process). It seems that Edward currently doesn’t have built-in support for those, right? I also wonder if anybody was able to successfully build such models, or would it make sense to include such models as built-in in the future just as the DirichletProcess one. Would they be too specialized so that they must be tailored to specific problems, and would the use cases be relatively few?