Gaussian mixture with covariance matrix via klqp method


#1

Hello, Thanks for Edward. This is awesome.

I am trying to implement Guassian mixture model (GMM) via klqp method. I saw some discussions that klqp is not ideal for GMM and Dirichlet distribution, but I am just trying for now…

Similar to Stan or HMC, categorical distribution doesn’t look like working with klqp or HMC. So I used ParamMixture and Mixture.
Here is my notebook for Mixture method.

link

As you can see the notebook, the data x ~ sum_k {N(mu[k],cov[k])}.

In the first trial, it looks like the model works.
inference = ed.KLqp({mu: qmu, sigma: qsigma}, data={x: x_train})

However, when I included pi:qpi (dirichlet distribution) in the klqp algorithm, the model fails. (see the last part of notebook).

To think of general working principle of variational inference, I guess pi:qpi should be included because the full joint probability of posterior will be p(pi,mu,sigma|x). (if the cluster assignment is marginalized in ed.model.Mixture).

I also applied ParamMixture for the same problem.
link

It looks like this model completely failed.
I think it might be better to use Gibbs sampler.
However, in the very large dataset or high dimensional problem, the full bayesian might fail, so this is the reason I am looking for something similar to variational inference. I didn’t try PyMC3.
My question is.

(1) why the model failed when I included pi:qpi in Mixture method.

(2) ParamMixture gave bad results. (compared to Mixture.

(3) What is the suggestion for the high-dimensional and large GMM problem? (I may use non-gaussian in the future…).

The notebook is written with edward 1.3.3 (dev. version) and tensorflow 1.2.1 (cpu).

Thanks


Dirichlet Process Mixture Models (DPMM) with hierarchical structure
#2

ed.KLqp is known not to work very well with Gamma or Dirichlet variational approximations. For high-dimensional and large GMM problems, you should generally stick with Gibbs sampling or coordinate ascent VI. These are the gold standards. (We haven’t implemented the latter yet.)


DPM model for clustering
#3

Hi again @dustin I wonder if it possible to use the Inverse-Wishart distribution in Edward in order to do Gibbs sampling for high-dimensional GMM problems. The code below uses KLqp, but I get an error concerning to WishartCholesky

InvalidArgumentError (see above for traceback): Input matrix is not invertible.
	 [[Node: inference/sample_9/qsigma/log_prob/OperatorPDCholesky/sqrt_solve/MatrixTriangularSolve = MatrixTriangularSolve[T=DT_FLOAT, adjoint=false, lower=true, _device="/job:localhost/replica:0/task:0/cpu:0"](inference/sample_9/LinearOperatorTriL/MatrixBandPart, inference/sample_9/qsigma/log_prob/Reshape)]]

The code is

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import edward as ed
import tensorflow as tf
from tensorflow.contrib.linalg import LinearOperatorTriL
ds = tf.contrib.distributions

import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse
import matplotlib.cm as cm
import six
import time

plt.style.use('ggplot')

from edward.models import Gamma, Beta, Categorical, WishartCholesky, \
MultivariateNormalTriL, Mixture, MultivariateNormalDiag


# -- Dataset --

N = 5000  # number of data points
true_K = 3  # number of components
D = 2  # dimensionality of data

true_pi = np.array([0.3, 0.6, 0.1])
true_mus = [[10, 1], [-10, -10], [-1, 20]]
true_stds = [[[3, -0.1], [-0.1, 2]], 
[[4, 0.0], [0.0, 1]], [[3, 0.2], [0.2, 4]]]

def build_toy_dataset(N,pi,mus,stds):
  x = np.zeros((N, 2), dtype=np.float32)
  label = np.zeros((N,),dtype=np.float32)
  for n in range(N):
    k = np.argmax(np.random.multinomial(1, pi))
    x[n, :] = np.random.multivariate_normal(mus[k], stds[k])
    label[n] = k
  return x, label

xn_data, xn_label = build_toy_dataset(N,true_pi, true_mus, true_stds)

K = 6  # number of components to test

#with tf.name_scope("model"): # GMM model

# sess = ed.get_session()
# tf.global_variables_initializer().run()

# Stick-break process
alpha = Gamma(concentration=6.0, rate=1.0, name="alpha")
beta = Beta(concentration1=1.0, concentration0=tf.ones(K)*alpha, name="beta")
pi = tf.concat([tf.reshape(beta[0],[1]), tf.reshape(tf.multiply(beta[1:],tf.cumprod(1 - beta[:-1])), [K-1])], 0, name="pi")

z = Categorical(probs=pi,sample_shape=N) # Assignments
# Hyperparameters
nu0 = tf.Variable(D, dtype=tf.float32, trainable=False)
psi0 = tf.Variable(np.eye(D), dtype=tf.float32, trainable=False)
mu0 = tf.Variable(np.zeros(D), dtype=tf.float32, trainable=False)
k0 = tf.Variable(1., dtype=tf.float32, trainable=False)

# Parameters
sigma = WishartCholesky(df=nu0, scale=psi0, cholesky_input_output_matrices=True, sample_shape = K, name="sigma_distribution")
mu = MultivariateNormalTriL(mu0, k0*sigma, name="mu_distribution")
components = [MultivariateNormalTriL(mu[k], sigma[k], sample_shape=N) for k in range(K)]
xn = Mixture(cat=z,components=components,sample_shape=N)

#with tf.name_scope("posterior"):
qalpha = Gamma(concentration=tf.Variable(tf.constant(1.0),name="qalpha_concentration"), rate=tf.Variable(tf.constant(1.0), name="qalpha_rate"))
qbeta = Beta(concentration1=1.0, concentration0=tf.Variable(tf.ones([K]),name="qconcetrarion0"))

qpsi0 = tf.Variable(tf.random_normal([K, D, D], dtype=tf.float32))
Ltril = LinearOperatorTriL(ds.matrix_diag_transform(qpsi0, transform=tf.nn.softplus)).to_dense()
qsigma = WishartCholesky(df=tf.ones([K])*100, scale=Ltril, cholesky_input_output_matrices=True, name="qsigma")

qmu0 = tf.Variable(tf.random_normal([K, D], dtype=tf.float32))
qR = tf.nn.softplus(tf.Variable(tf.zeros([K, D], dtype=tf.float32)))
qmu = MultivariateNormalDiag(qmu0, qR)

inference = ed.KLqp({alpha:qalpha, beta:qbeta, mu: qmu, sigma: qsigma}, data={xn: xn_data})

learning_rate = 1e-3
optimizer = tf.train.AdamOptimizer(learning_rate)
n_iter=100000
n_samples=10

inference.initialize(n_iter=n_iter, n_print=0, n_samples=n_samples, optimizer = optimizer)

sess = ed.get_session()
init = tf.global_variables_initializer()
init.run()

learning_curve = []
for _ in range(inference.n_iter):
    info_dict = inference.update()
    if _%10000 == 0:
        print(info_dict)
    learning_curve.append(info_dict['loss'])


post_beta = qbeta.sample(500)
post_pi = tf.map_fn(lambda beta: tf.concat([tf.reshape(beta[0],[1]), tf.reshape(tf.multiply(beta[1:],tf.cumprod(1 - beta[:-1])), [K-1])], 0), post_beta)
post_pi_mean = post_pi.eval().mean(axis=0)

# Two subplots, unpack the axes array immediately
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 6))
plot_w = np.arange(K) + 1
ax1.bar(plot_w - 0.5, post_pi_mean, width=1., lw=0);
ax1.set_xlim(0.5, K);
ax1.set_xlabel('Component');
ax1.set_ylabel('Posterior expected mixture weight');
ax2.semilogy(learning_curve)
fig.savefig(time.strftime("%Y_%m_%d-%H_%M_%S")+'-var_K='+np.str(K)+'-lr='+np.str(learning_rate)+'-n_iter='+np.str(n_iter)+'-n_samples='+np.str(n_samples)+'.pdf')


#4

But it should be possible to do Variational Inference on a simple 2-D GMM. My losses are bouncing up and down, even when marginalizing out the mixture assignments. What’s wrong with KLqp? I’d be happy to help fixing it, but would prefer to be sure that it’s impossible to use it on a GMM.


#5

Hello, I’ve been away from this problem for a while.

GMM is known as solved by varaitional inference.
But, in many cases, the update of lower bound is given in a close form such as http://scikit-learn.org/stable/modules/dp-derivation.html . And I guess this is the reason it works.

Mixture problem might be difficult in KLQP approach since it is almost black box algorithm.
By the way, when I use ADVI in STAN, it gave a fairly better result. Guess you can improve this with the transformation of distribution? But, I am not very familiar with how this works in detail.