Fitting a Normal Distribution (comparison with stan, PyMC)

cshenton · August 25, 2017, 8:58am

I’ve written a super simple example trying to recover the scale and location of a normal distribution in edward, pymc3, and pystan. It’s for my own purposes, but I thought it might be interesting here to show the differences between the APIs.

My pymc3 and stan examples are working fine, but I’m getting some unexpected results from the edward implementation.

Estimated scale parameter mean varies a lot (around 0.1-0.2 for a scale of 1.5) from run to run for KLqp
Estimation fails entirely for HMC

The repository with each of the implementations is here, if anyone is interested. I made sure that the specification was the same across each example. At a high level, the advi implementation in pymc seems to be producing more stable results for a given runtime.

dustin · August 25, 2017, 11:52pm

Thanks for the comparison! That was interesting. Here are some tips:

You can write the model more simply as

y = NormalWithSoftplusScale(loc=mu, scale=inv_softplus_sigma, sample_shape=N)

ed.KLqp got to the answer in very few iterations and simply perturbed around the answer over more iterations. You can more stable scale parameters across runs using a smaller learning rate.
Never believe in automatic algorithms. They can give a false sense of security in toy problems and failure in real problems. In particular, ed.HMC here is not automatic: you need to tune the step size parameter and number of leapfrog steps.

In general, successes/failures on a normal-normal model are not illuminating because they may not be indicative of real successes/failures. For example, decide on a real-world task such as Bayesian neural network classification on MNIST. How easy is it in the PPL to specify the program, inference, and evaluation metric? Which is most automatic? Which is most flexible with respect to the model and/or inference? Which is fastest? Which had the best documentation to help with this experiment?

cshenton · August 26, 2017, 1:22am

Hey Dustin, thanks for the reply, Edward’s approach is by far my favourite, even if it does lack the bells and whistles of the more mature libraries.

That’s way cleaner, thanks.
Good to know, I tried to find where to set the learning rate in the docs but have come up short. Is it strictly part of edward’s API? Or is it one of the kwargs that gets passed down to the tensorflow optimiser?
Still having trouble with this one. No matter what value I’m setting for step_size and n_steps I’m getting a flat 0.000 acceptance rate on every run. Is this because HMC isn’t doing warm up?

That’s a fair point. My current use case is around time series, so correct estimation of errors is more important to me than dealing with high dimensional parameter spaces, like with bayesian neural networks. Also I’m a firm believer in recovering parameters from simulated data before using real data, and a static normal distribution is a baseline I’ll use to compare to more complex models, so it’s a toy example, but one that I need to work.

I’d like to understand the algorithms in Edward a little more though, could you point me to the paper for KLqp?

dustin · August 26, 2017, 3:11am

It’s one of the arguments to the optimizer. For example,

optimizer = tf.train.AdamOptimizer(learning_rate=1e-3)
inference.run(optimizer=optimizer)

You can find more doc in TensorFlow’s tf.train.Optimizer API.

Zero acceptance rate implies the sample is always being rejected. You should lower the step_size to avoid too large proposals. Tuning it alongside the number of leapfrog steps, the following gets good results. Note I also took the mean over the last 1000 samples and ignored the first 1000.

import edward as ed
import numpy as np
import tensorflow as tf

from edward.models import (
    Empirical,
    Normal,
    NormalWithSoftplusScale,
)

MU = 6.0
SIGMA = 1.5
N = 1000

# DATA
y_train = np.random.normal(MU, SIGMA, [N])

# MODEL
mu = Normal(loc=0.0, scale=5.0)
inv_softplus_sigma = Normal(loc=0.0, scale=1.0)
y = NormalWithSoftplusScale(loc=mu, scale=inv_softplus_sigma, sample_shape=N)

q_mu = Empirical(params=tf.Variable(tf.random_normal([2000])))
q_inv_softplus_sigma = Empirical(params=tf.Variable(tf.random_normal([2000])))

# INFERENCE
inference = ed.HMC({mu: q_mu, inv_softplus_sigma: q_inv_softplus_sigma},
                   {y: y_train})
inference.run(step_size=0.003, n_steps=5)

print(tf.reduce_mean(q_mu.params[1000:]).eval())
print(tf.nn.softplus(tf.reduce_mean(q_inv_softplus_sigma.params[1000:])).eval())

On my Laptop’s CPU, this returns

2000/2000 [100%] ██████████████████████████████ Elapsed: 8s | Acceptance Rate: 0.999
6.01639
1.50348

cshenton · August 27, 2017, 11:03am

Thanks Dustin, that makes things super clear, especially around how burn in works with MCMC in edward. Have updated things on my end and it’s all working!

Per your suggestion, will give edward a go on a real world problem now that I have the basics down.

Topic		Replies	Views
Parameter Learning with Simple Bayesian Network; PyMC3 vs. Edward; Edward posteriors not converging around correct parameter values	3	2862	March 27, 2018
A toy normal model failed (klqp) and why?	2	1689	July 25, 2017
Fitting a t-distribution (BEST)	5	1225	September 25, 2017
Fixed scales in regression example	6	1395	August 29, 2017
HMC and KLpq fail on edward while pymc3 works fine	0	924	April 16, 2018

Fitting a Normal Distribution (comparison with stan, PyMC)

Related topics