MDPs using Edward


Hi all,

just thought I’d start a thread for folks interested in using Edward to do variational inference/optimization for MDPs and RL/sequential decisions in general.

Here’s roughly the setup I think is a good first attempt for most tasks

taken from Shakir Mohamed’s NIPs 2016 talk,

I’m only working on very simple stuff as proof of principle. So for example I’m interested in getting bandits, or grid worlds like Frozen-Lake working, and then taking things from there.

Be great to here what other sequential decision tasks folks have managed to get working/are interested in applying variational inference to.




I’m also a big fan of Bayesian policy search. Bayesian policy search is a simple method that’s easy to add onto current state-of-the-art policy gradient methods. And it’s easy to see where model-based RL/learning a dynamics model of the environment fits in.


Thanks a lot for the thumbs up!

Still think I need to read up a bit more before I’m confident I know what I’m doing. I’m guessing I’ve been a bit too ambitious with my first attempt - I’m trying to modify this repo

which has a fairly nice/clean implementation of A3C applied to bandits, and gridworlds - because they’re simple tasks they train train quite quickly - hours rather than days :slight_smile:

The A3C-LSTM RL algorithm is a fairly friendly as it’s already got something similar to the last entropy term in the equation above, and also a stochastic policy.


One last thing about the above Meta-RL implementation is it seems to be conceptually very simply to the global optimization setup in this paper,

Learning to learn without gradient descent by gradient descent. (arXiv:1611.03824v4 [stat.ML] UPDATED)

Which seems to be an improvement over previous Bayesian methods, (i.e. Sprearmint), for hyper-parameter tuning.