just thought I’d start a thread for folks interested in using Edward to do variational inference/optimization for MDPs and RL/sequential decisions in general.

Here’s roughly the setup I think is a good first attempt for most tasks

I’m only working on very simple stuff as proof of principle. So for example I’m interested in getting bandits, or grid worlds like Frozen-Lake working, and then taking things from there.

Be great to here what other sequential decision tasks folks have managed to get working/are interested in applying variational inference to.

I’m also a big fan of Bayesian policy search. Bayesian policy search is a simple method that’s easy to add onto current state-of-the-art policy gradient methods. And it’s easy to see where model-based RL/learning a dynamics model of the environment fits in.

Still think I need to read up a bit more before I’m confident I know what I’m doing. I’m guessing I’ve been a bit too ambitious with my first attempt - I’m trying to modify this repo

which has a fairly nice/clean implementation of A3C applied to bandits, and gridworlds - because they’re simple tasks they train train quite quickly - hours rather than days

The A3C-LSTM RL algorithm is a fairly friendly as it’s already got something similar to the last entropy term in the equation above, and also a stochastic policy.