Black box alpha divergence minimization


#1

Hi

I am attempting to use black box alpha divergence minimization (paper) for my research and I would also like to gain familiarity in Edward and TensorFlow. I have been basing my code from klqp.py and jbr’s Renyi Divergence inference methods.

The reparameterized BB-alpha ELBO is

$ \mathcal{L}_{\alpha}(q) \approx \text{KL}[q||p_0] - \frac{1}{\alpha} \sum_n \log E_q[f_n(\omega)^\alpha] $

I have not finished a working version of the code as I am having trouble coding the likelihood term $ f_n(\omega)^\alpha $. This will differ for different tasks, e.g. for a regression task the log likelihood term may be

$ \log p(\mathbf{y} \mid \mathbf{w}, b, \mathbf{X}) = \sum_n \log \text{Normal}(y_n \mid \mathbf{x}_n^\top\mathbf{w} + b, \sigma_y^2) $

How would I obtain this likelihood term? Would it be from the training data?

From what I understand, I may only be required for now to change the loss function in line 650 in klqp.py . Another possibly unrelated question is why in “build_reparam_kl_loss_and_gradients”, the log likelihood is obtained in exactly in the same way as the log posterior in “build_reparam_loss_and_gradients” ?

Thank you

wset2


#2

The “likelihood” variables are stored in the data arg to Inference. If it differs for different cases, I imagine you’ll have to try to detect that during inputs and raise an error; or otherwise just note it in the implemented docstrings.

Yes that’s correct. One builds - E [ log p(x | z) ] + KL (q(z) || p(z)) and the other builds - E [ log p(x | z) + log p(z) - log q(z)].


#3

Alright, I understand that is the case for “build_reparam_kl_loss_and_gradients”.

But what I don’t understand is if p_log_prob = p_log_lik = log likehood, this would mean in “build_reparam_loss_and_gradients” the ELBO is - E [ log p(x|z) - log q(z) ] . I’m not sure where the log p(z) term is built into the ELBO. ( line 650 in klqp.py )

Unless the p_log_prob in “build_reparam_loss_and_gradients” is log p(x,z) , then in “build_reparam_kl_loss_and_gradients” the ELBO is instead building - E [ log p(x,z) ] + KL (q(z) || p(z)) ?


#4

p_log_prob generally stands for the log joint and p_log_lik generally stands for the log-likelihood in that code. You can also see it in the for loop code.


#5

Thank you for your help @dustin and would like to add that the team is doing a great job on Edward :slight_smile:

I am attaching my first working version of the code here as I believe it works in the way I would want it to.

I also used jbr’s Renyi Divergence vae example and made an extremely similar version here.

The ELBO I have used is based on the reparameterized BB-alpha energy in this paper in section 4.

Could someone please check the logic of my code?

To test that the code works properly, alpha = 0 should perform similarly to KLqp and alpha = 1 should perform similar to KLpq. An arbitrary value of 0.5 was chosen in my example.


#6

Would you be interested in submitting a PR? Unfortunately I don’t have time to do code review unless it also helps Edward.


#7

I noticed an error in my implementation. May I ask how does the code below work? p_log_prob obtains p(z,x) and q_log_prob obtains q(z).

for z, qz in six.iteritems(inference.latent_vars):
  # Copy q(z) to obtain new set of posterior samples.
  qz_copy = copy(qz, scope=scope)
  dict_swap[z] = qz_copy.value()
  q_log_prob[s] += tf.reduce_sum(
      inference.scale.get(z, 1.0) *
      qz_copy.log_prob(tf.stop_gradient(dict_swap[z])))

for z in six.iterkeys(inference.latent_vars):
  z_copy = copy(z, dict_swap, scope=scope)
  p_log_prob[s] += tf.reduce_sum(
      inference.scale.get(z, 1.0) * z_copy.log_prob(dict_swap[z]))

for x in six.iterkeys(inference.data):
  if isinstance(x, RandomVariable):
    x_copy = copy(x, dict_swap, scope=scope)
    p_log_prob[s] += tf.reduce_sum(
        inference.scale.get(x, 1.0) * x_copy.log_prob(dict_swap[x]))

If I am interested in the prior p(z), do I just have it like this?

for z in six.iterkeys(inference.latent_vars):
  z_copy = copy(z, dict_swap, scope=scope)
  p_log_prob[s] += tf.reduce_sum(
      inference.scale.get(z, 1.0) * z_copy.log_prob(dict_swap[z]))

#8

So I finally had time to come back to this. I am submitting a new pull request for black box alpha. The inference object is here and the unit test here. Please advise if I can make the code better or help out if you can.

Here is the link to the pull request.