Agreed that building on KLqp makes sense, since one could easily do ‘advi’ with the current KLqp inference (although it would be not so automatic).
That WIP squares with how I was thinking of doing automated transforms, and totally agree about the API looking like the current automatic MAP, MCMC variational models. I’m a little surprised that
tf.Distribution doesn’t have a support attribute, but implementing that shouldn’t be too troublesome.
Regarding convergence diagnostics, agreed. pymc3 has an unconventional adagrad that effectively windows the gradients. I think a preferable solution is offering a range of standard SGD optimisers but with defaults that are sensible for VI (then offering the tooling the user needs to adapt them when those defaults fail).
For example, I’ve found a change from the Adam beta1, beta2 default of (0.99, 0.999) to (0.9, 0.99) will go from total failure to converge to convergence in seconds for certain problems, if this is true for VI optimisation problems more broadly, it would be great to implement Adam defaults in edward that differ from those in tensorflow.