Renyi divergence variational inference


In order to get a better understanding of Edward mechanic (and also because it will be useful for my research), I would like to implement the Renyi variational objective (paper) in Edward. A public implementation in Tensorflow is available (github), so I though it would be quite an easy project to start with.

I’ve attached a 1st version of the code and an example on VAE.
It’s running, does something sensible but I still think it’s not completely correct.
Can someone please check the logic?
Code: inference objectexample

Here are some details:
To compute the Renyi ELBO they use 3 tricks:

  1. Reparametrization trick
  2. Stochastic approximation of the joint likelihood
  3. Monte Carlo approximation of the VR bound

So I thought using the klqp and more specifically the reparametrized non analytic version “build_reparam_loss_and_gradients” as a template was a good start.

If I have understand correctly, in “p_log_prob” and “q_log_prob” there’s a “n_samples” estimate of the joint likelihood and the variational approximate.
From the docstring of “build_reparam_loss_and_gradients”:

Computed by sampling from $q(z;\lambda)$ and evaluating the expectation using Monte Carlo sampling.

So I think that should be trick 2 and 3 covered. Are I’m completely wrong?

After looking at the code from klqp, I’m not sure where the reparametrization trick is applied. But I think I’ve done everything in the same way so it should be used. Can someone confirm/help with that?


The code looks great. Some comments:

  • In, we don’t place the build_loss_and_gradients function as a method inside KLqp because we use it across many KLqp algorithms. Since your function is only used in one class, it’s recommended you write it as a method (c.f.,
  • What’s justification for a default alpha=0.2?
  • What does a ‘min’ backward pass correspond to? I’m not sure if I recall a VR-min; does it correspond to alpha → \infty? I haven’t done the math.
  • Is the logF = tf.reshape(logF, [inference.n_samples, 1]) reshape necessary? Seems like you could just do logF = tf.stack(logF)
  • Since you only clip on the LHS, you can change logF = tf.log(tf.clip_by_value(tf.reduce_mean(tf.exp(logF - logF_max), 0), 1e-9, np.inf)) to use tf.maximum(1e-9, *).

Would you be interested in submitting a PR? The algorithm would be a nice addition to Edward’s arsenal.

You’re correct. But it also covers trick 1: if the distributions have the property reparameterization_type == tf.contrib.distributions.FULLY_REPARAMETERIZED, then gradients with respect to distribution parameters backpropagate through the sampling. See also discussion in Gradient is incorrect for log pdf of Normal distribution · Issue #7236 · tensorflow/tensorflow · GitHub.

Thanks for the very detailed reply.

Yes that make sense. I’ll change that.

None. It was just to avoid being by default in a special case for testing in case I was not using the correct way to feed the parameter alpha. If I was to release the code I would set a default alpha=1. to get VI by default.

I need to do the math to but yes that my intuition on this. alpha → \infty correspond to a zero-enforcing behavior. In that case it would make sense to learn from the sample with the smallest loss (hand waving here — I’ll do the math).

True. I’ll remove that

Again true.

Sure why not. Let me clean the code a bit, do the math for VR-min and I’ll open a request.

Informative pointer thanks.