The class of problems I’m working on (bayesian structural time series), work pretty nicely with ADVI (such as pymc3’s) out of the box, and with the existing inferences in Edward, I’m finding myself tweaking things more often than I’d like. In particular, estimating scale parameters on time series uncertainty, I’m having trouble getting the correct estimates at all. So I’m planning on implementing ADVI in Edward.

Beyond following the Kukucelbir et al (2016) paper and existing implementations in stan and pymc, do existing contributors have any wise words when it comes to subclassing ed.inferences.VariationalInference? Or perhaps useful supplementary material relating to ADVI specifically?

Cool! To implement ADVI, I recommend improving ed.KLqp instead of making a new inference algorithm. IMO, new algorithms should be implemented based on their fundamental contributions rather than implementing a wholly new algorithm for every new paper.

For example, there are a few ADVI features you could work on:

Automated transformations. You can imagine this working by having an optional argument in inference.initialize which transforms the prior and posterior approximations to be on the unconstrained space if their supports are mismatched. There was a WIP here that I never got around to finishing (https://github.com/blei-lab/edward/compare/feature/automated-transformations). The design is simple: there is a transform function, and inference.initialize will repeatedly call transform on all the latent variables if the supports are mismatched; if the arg for desiring automated transforms is False we raise an error.

Automatic choice of variational approximation. You can imagine this working similar to how MAP and MCMC algs don’t always require explicitly defining the pointmass and empirical distributions.

Initialization / hyperparameter adaptation and convergence diagnostics. These are likely useful across the board. ADVI currently doesn’t do either very well—this is a hugely important research challenge that I think people should work on.

Agreed that building on KLqp makes sense, since one could easily do ‘advi’ with the current KLqp inference (although it would be not so automatic).

That WIP squares with how I was thinking of doing automated transforms, and totally agree about the API looking like the current automatic MAP, MCMC variational models. I’m a little surprised that tf.Distribution doesn’t have a support attribute, but implementing that shouldn’t be too troublesome.

Regarding convergence diagnostics, agreed. pymc3 has an unconventional adagrad that effectively windows the gradients. I think a preferable solution is offering a range of standard SGD optimisers but with defaults that are sensible for VI (then offering the tooling the user needs to adapt them when those defaults fail).

For example, I’ve found a change from the Adam beta1, beta2 default of (0.99, 0.999) to (0.9, 0.99) will go from total failure to converge to convergence in seconds for certain problems, if this is true for VI optimisation problems more broadly, it would be great to implement Adam defaults in edward that differ from those in tensorflow.

But to recap, sounds like a minimum viable implementation could be broken up into these PRs

Add support attribute to RV class

Write utility to automatically transform a provided RV to the real line

Add option to KLqp to automatically build variational model. (by transforming priors, then using spherical multivariate normal)

With the biggest design problem in the latter being deciding how to store the transformed variables. One solution is to store the transformed priors and their variational models under RV._transformed_latent_vars, then put the original priors, and transformed variational models in RV.latent_vars. Which means the whole transformation is nicely abstracted away.

We have supports added to some classes in random_variables.py. This means you can work immediately on the PR with the transform utility function + unit tests (for random variables without the support attribute, you can raise an error for now). Then the next PR can use the function within inference.initialize.

There are still design questions about the best way to add supports for all random variables; I think we’ll have a clearer idea as applications using them are implemented.

Following TransformedDistribution, they’re another random variable. No new methods needed. See the WIP branch for details.