Renyi divergence variational inference



In order to get a better understanding of Edward mechanic (and also because it will be useful for my research), I would like to implement the Renyi variational objective (paper) in Edward. A public implementation in Tensorflow is available (github), so I though it would be quite an easy project to start with.

I’ve attached a 1st version of the code and an example on VAE.
It’s running, does something sensible but I still think it’s not completely correct.
Can someone please check the logic?
Code: inference objectexample

Here are some details:
To compute the Renyi ELBO they use 3 tricks:

  1. Reparametrization trick
  2. Stochastic approximation of the joint likelihood
  3. Monte Carlo approximation of the VR bound

So I thought using the klqp and more specifically the reparametrized non analytic version “build_reparam_loss_and_gradients” as a template was a good start.

If I have understand correctly, in “p_log_prob” and “q_log_prob” there’s a “n_samples” estimate of the joint likelihood and the variational approximate.
From the docstring of “build_reparam_loss_and_gradients”:

Computed by sampling from $q(z;\lambda)$ and evaluating the expectation using Monte Carlo sampling.

So I think that should be trick 2 and 3 covered. Are I’m completely wrong?

After looking at the code from klqp, I’m not sure where the reparametrization trick is applied. But I think I’ve done everything in the same way so it should be used. Can someone confirm/help with that?



The code looks great. Some comments:

  • In, we don’t place the build_loss_and_gradients function as a method inside KLqp because we use it across many KLqp algorithms. Since your function is only used in one class, it’s recommended you write it as a method (c.f.,
  • What’s justification for a default alpha=0.2?
  • What does a ‘min’ backward pass correspond to? I’m not sure if I recall a VR-min; does it correspond to alpha -> \infty? I haven’t done the math.
  • Is the logF = tf.reshape(logF, [inference.n_samples, 1]) reshape necessary? Seems like you could just do logF = tf.stack(logF)
  • Since you only clip on the LHS, you can change logF = tf.log(tf.clip_by_value(tf.reduce_mean(tf.exp(logF - logF_max), 0), 1e-9, np.inf)) to use tf.maximum(1e-9, *).

Would you be interested in submitting a PR? The algorithm would be a nice addition to Edward’s arsenal.

You’re correct. But it also covers trick 1: if the distributions have the property reparameterization_type == tf.contrib.distributions.FULLY_REPARAMETERIZED, then gradients with respect to distribution parameters backpropagate through the sampling. See also discussion in


Thanks for the very detailed reply.

Yes that make sense. I’ll change that.

None. It was just to avoid being by default in a special case for testing in case I was not using the correct way to feed the parameter alpha. If I was to release the code I would set a default alpha=1. to get VI by default.

I need to do the math to but yes that my intuition on this. alpha -> \infty correspond to a zero-enforcing behavior. In that case it would make sense to learn from the sample with the smallest loss (hand waving here — I’ll do the math).

True. I’ll remove that

Again true.

Sure why not. Let me clean the code a bit, do the math for VR-min and I’ll open a request.

Informative pointer thanks.