Fitting a Normal Distribution (comparison with stan, PyMC)


#1

I’ve written a super simple example trying to recover the scale and location of a normal distribution in edward, pymc3, and pystan. It’s for my own purposes, but I thought it might be interesting here to show the differences between the APIs.

My pymc3 and stan examples are working fine, but I’m getting some unexpected results from the edward implementation.

  • Estimated scale parameter mean varies a lot (around 0.1-0.2 for a scale of 1.5) from run to run for KLqp
  • Estimation fails entirely for HMC

The repository with each of the implementations is here, if anyone is interested. I made sure that the specification was the same across each example. At a high level, the advi implementation in pymc seems to be producing more stable results for a given runtime.


#2

Thanks for the comparison! That was interesting. Here are some tips:

  1. You can write the model more simply as
y = NormalWithSoftplusScale(loc=mu, scale=inv_softplus_sigma, sample_shape=N)
  1. ed.KLqp got to the answer in very few iterations and simply perturbed around the answer over more iterations. You can more stable scale parameters across runs using a smaller learning rate.
  2. Never believe in automatic algorithms. They can give a false sense of security in toy problems and failure in real problems. In particular, ed.HMC here is not automatic: you need to tune the step size parameter and number of leapfrog steps.

In general, successes/failures on a normal-normal model are not illuminating because they may not be indicative of real successes/failures. For example, decide on a real-world task such as Bayesian neural network classification on MNIST. How easy is it in the PPL to specify the program, inference, and evaluation metric? Which is most automatic? Which is most flexible with respect to the model and/or inference? Which is fastest? Which had the best documentation to help with this experiment?


#3

Hey Dustin, thanks for the reply, Edward’s approach is by far my favourite, even if it does lack the bells and whistles of the more mature libraries.

  1. That’s way cleaner, thanks.
  2. Good to know, I tried to find where to set the learning rate in the docs but have come up short. Is it strictly part of edward’s API? Or is it one of the kwargs that gets passed down to the tensorflow optimiser?
  3. Still having trouble with this one. No matter what value I’m setting for step_size and n_steps I’m getting a flat 0.000 acceptance rate on every run. Is this because HMC isn’t doing warm up?

That’s a fair point. My current use case is around time series, so correct estimation of errors is more important to me than dealing with high dimensional parameter spaces, like with bayesian neural networks. Also I’m a firm believer in recovering parameters from simulated data before using real data, and a static normal distribution is a baseline I’ll use to compare to more complex models, so it’s a toy example, but one that I need to work.

I’d like to understand the algorithms in Edward a little more though, could you point me to the paper for KLqp?


#4

It’s one of the arguments to the optimizer. For example,

optimizer = tf.train.AdamOptimizer(learning_rate=1e-3)
inference.run(optimizer=optimizer)

You can find more doc in TensorFlow’s tf.train.Optimizer API.

Zero acceptance rate implies the sample is always being rejected. You should lower the step_size to avoid too large proposals. Tuning it alongside the number of leapfrog steps, the following gets good results. Note I also took the mean over the last 1000 samples and ignored the first 1000.

import edward as ed
import numpy as np
import tensorflow as tf

from edward.models import (
    Empirical,
    Normal,
    NormalWithSoftplusScale,
)

MU = 6.0
SIGMA = 1.5
N = 1000

# DATA
y_train = np.random.normal(MU, SIGMA, [N])

# MODEL
mu = Normal(loc=0.0, scale=5.0)
inv_softplus_sigma = Normal(loc=0.0, scale=1.0)
y = NormalWithSoftplusScale(loc=mu, scale=inv_softplus_sigma, sample_shape=N)

q_mu = Empirical(params=tf.Variable(tf.random_normal([2000])))
q_inv_softplus_sigma = Empirical(params=tf.Variable(tf.random_normal([2000])))

# INFERENCE
inference = ed.HMC({mu: q_mu, inv_softplus_sigma: q_inv_softplus_sigma},
                   {y: y_train})
inference.run(step_size=0.003, n_steps=5)

print(tf.reduce_mean(q_mu.params[1000:]).eval())
print(tf.nn.softplus(tf.reduce_mean(q_inv_softplus_sigma.params[1000:])).eval())

On my Laptop’s CPU, this returns

2000/2000 [100%] ██████████████████████████████ Elapsed: 8s | Acceptance Rate: 0.999
6.01639
1.50348

#5

Thanks Dustin, that makes things super clear, especially around how burn in works with MCMC in edward. Have updated things on my end and it’s all working!

Per your suggestion, will give edward a go on a real world problem now that I have the basics down.