Hi all,
just thought I’d start a thread for folks interested in using Edward to do variational inference/optimization for MDPs and RL/sequential decisions in general.
Here’s roughly the setup I think is a good first attempt for most tasks
taken from Shakir Mohamed’s NIPs 2016 talk, https://www.youtube.com/watch?v=AggqBRdz6CQ&feature=youtu.be&t=9m53s
I’m only working on very simple stuff as proof of principle. So for example I’m interested in getting bandits, or grid worlds like Frozen-Lake working, and then taking things from there.
Be great to here what other sequential decision tasks folks have managed to get working/are interested in applying variational inference to.
Cheers,
Aj
2 Likes
I’m also a big fan of Bayesian policy search. Bayesian policy search is a simple method that’s easy to add onto current state-of-the-art policy gradient methods. And it’s easy to see where model-based RL/learning a dynamics model of the environment fits in.
1 Like
Thanks a lot for the thumbs up!
Still think I need to read up a bit more before I’m confident I know what I’m doing. I’m guessing I’ve been a bit too ambitious with my first attempt - I’m trying to modify this repo
which has a fairly nice/clean implementation of A3C applied to bandits, and gridworlds - because they’re simple tasks they train train quite quickly - hours rather than days
The A3C-LSTM RL algorithm is a fairly friendly as it’s already got something similar to the last entropy term in the equation above, and also a stochastic policy.
One last thing about the above Meta-RL
implementation is it seems to be conceptually very simply to the global optimization setup in this paper,
Learning to learn without gradient descent by gradient descent. (arXiv:1611.03824v4 [stat.ML] UPDATED) http://ift.tt/2g4zLK3
Which seems to be an improvement over previous Bayesian methods, (i.e. Sprearmint), for hyper-parameter tuning.