In order to get a better understanding of Edward mechanic (and also because it will be useful for my research), I would like to implement the Renyi variational objective (paper) in Edward. A public implementation in Tensorflow is available (github), so I though it would be quite an easy project to start with.
I’ve attached a 1st version of the code and an example on VAE.
It’s running, does something sensible but I still think it’s not completely correct.
Can someone please check the logic?
Code: inference object — example
Here are some details:
To compute the Renyi ELBO they use 3 tricks:
Stochastic approximation of the joint likelihood
Monte Carlo approximation of the VR bound
So I thought using the klqp and more specifically the reparametrized non analytic version “build_reparam_loss_and_gradients” as a template was a good start.
If I have understand correctly, in “p_log_prob” and “q_log_prob” there’s a “n_samples” estimate of the joint likelihood and the variational approximate.
From the docstring of “build_reparam_loss_and_gradients”:
Computed by sampling from $q(z;\lambda)$ and evaluating the expectation using Monte Carlo sampling.
So I think that should be trick 2 and 3 covered. Are I’m completely wrong?
After looking at the code from klqp, I’m not sure where the reparametrization trick is applied. But I think I’ve done everything in the same way so it should be used. Can someone confirm/help with that?
In klqp.py, we don’t place the build_loss_and_gradients function as a method inside KLqp because we use it across many KLqp algorithms. Since your function is only used in one class, it’s recommended you write it as a method (c.f., klpq.py).
What’s justification for a default alpha=0.2?
What does a ‘min’ backward pass correspond to? I’m not sure if I recall a VR-min; does it correspond to alpha → \infty? I haven’t done the math.
Is the logF = tf.reshape(logF, [inference.n_samples, 1]) reshape necessary? Seems like you could just do logF = tf.stack(logF)
Since you only clip on the LHS, you can change logF = tf.log(tf.clip_by_value(tf.reduce_mean(tf.exp(logF - logF_max), 0), 1e-9, np.inf)) to use tf.maximum(1e-9, *).
Would you be interested in submitting a PR? The algorithm would be a nice addition to Edward’s arsenal.
None. It was just to avoid being by default in a special case for testing in case I was not using the correct way to feed the parameter alpha. If I was to release the code I would set a default alpha=1. to get VI by default.
I need to do the math to but yes that my intuition on this. alpha → \infty correspond to a zero-enforcing behavior. In that case it would make sense to learn from the sample with the smallest loss (hand waving here — I’ll do the math).
True. I’ll remove that
Sure why not. Let me clean the code a bit, do the math for VR-min and I’ll open a request.