MDPs using Edward

Hi all,

just thought I’d start a thread for folks interested in using Edward to do variational inference/optimization for MDPs and RL/sequential decisions in general.

Here’s roughly the setup I think is a good first attempt for most tasks

taken from Shakir Mohamed’s NIPs 2016 talk, https://www.youtube.com/watch?v=AggqBRdz6CQ&feature=youtu.be&t=9m53s

I’m only working on very simple stuff as proof of principle. So for example I’m interested in getting bandits, or grid worlds like Frozen-Lake working, and then taking things from there.

Be great to here what other sequential decision tasks folks have managed to get working/are interested in applying variational inference to.

Cheers,

Aj

2 Likes

I’m also a big fan of Bayesian policy search. Bayesian policy search is a simple method that’s easy to add onto current state-of-the-art policy gradient methods. And it’s easy to see where model-based RL/learning a dynamics model of the environment fits in.

1 Like

Thanks a lot for the thumbs up!

Still think I need to read up a bit more before I’m confident I know what I’m doing. I’m guessing I’ve been a bit too ambitious with my first attempt - I’m trying to modify this repo

which has a fairly nice/clean implementation of A3C applied to bandits, and gridworlds - because they’re simple tasks they train train quite quickly - hours rather than days

The A3C-LSTM RL algorithm is a fairly friendly as it’s already got something similar to the last entropy term in the equation above, and also a stochastic policy.

One last thing about the above `Meta-RL` implementation is it seems to be conceptually very simply to the global optimization setup in this paper,

Learning to learn without gradient descent by gradient descent. (arXiv:1611.03824v4 [stat.ML] UPDATED) http://ift.tt/2g4zLK3

Which seems to be an improvement over previous Bayesian methods, (i.e. Sprearmint), for hyper-parameter tuning.