Fitting a t-distribution (BEST)


#1

Hello,

I am a new user of Edward and I am currently trying to implement the BEST Bayesian testing described here and implemented on top of PyMC in here.

The method relies on fitting a t-distribution to observed data, with a normal distributed prior for the mean, uniformly distributed prior for the scale and an exponentially distributed prior for the degrees of freedom.
I was able to implement this for a simplified basecase of fitting a normal distribution, but I get very unsatisfying results for the degrees of freedom when I extend this to the t-distribution, even in a toy example.

I hope this is the right place to post this, as I have found a multitude of examples for very basic cases involving the normal distribution online, but nothing for this usecase or similar, especially when relying on sampling and not variational inference.

The following code should reproduce my issue in jupyter.

import edward as ed
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import tensorflow as tf
import scipy.stats as stats
from edward.models import Empirical, Exponential, Normal, Uniform, StudentTWithAbsDfSoftplusScale
`# DATA`
y_train = stats.t.rvs(10, loc=2, scale=5, size=1000, random_state=None)
print(y_train.mean())
print(y_train.std())
`# MODEL`
mu = Normal(loc=0.0, scale=5.0)
std = Uniform(low=0.01, high=10.0)
nu_minus_one = Exponential(rate=1/29)
y = StudentTWithAbsDfSoftplusScale(loc=mu, scale=std, df=nu_minus_one, sample_shape=1000)
q_mu = Empirical(params=tf.Variable(tf.zeros(5000)))
q_std = Empirical(params=tf.Variable(tf.random_uniform([5000], minval=0.01, maxval=10.0)))
q_nu = Empirical(params=tf.Variable(tf.random_uniform([5000], minval=5.0, maxval=30.0)))
`# INFERENCE`
`# inference = ed.HMC({mu: q_mu, std: q_std, nu_minus_one: q_nu}, data={y: y_train})`
`# inference.run(step_size=0.005, n_steps=5)`
`# INFERENCE`
inference = ed.SGHMC({mu: q_mu, std: q_std, nu_minus_one: q_nu}, data={y: y_train})
inference.run(step_size=1e-2, friction=0.1)
print(tf.reduce_mean(q_mu.params[1000:]).eval())
`# 1.7592`
print(tf.reduce_mean(q_std.params[1000:]).eval())
`# 5.15432`
print(tf.reduce_mean(q_nu.params[1000:]).eval())
`# 19.1388`
f, ax = plt.subplots()
f.suptitle("Trace of mu")
plt.plot(q_mu.params[1000:].eval())
f, ax = plt.subplots()
f.suptitle("Trace of std")
plt.plot(q_std.params[1000:].eval())
f, ax = plt.subplots()
f.suptitle("Trace of nu")
plt.plot(q_nu.params[1000:].eval())
f, ax = plt.subplots()
f.suptitle('Distribution of mu')
sns.distplot(q_mu.params[1000:].eval())
f, ax = plt.subplots()
f.suptitle('Distribution of std')
sns.distplot(q_std.params[1000:].eval())
f, ax = plt.subplots()
f.suptitle('Distribution of nu')
sns.distplot(q_nu.params[1000:].eval())

While the trace of mu and std appear satisfying, the trace as well as the distribution of nu appear unsatisfying to me.

Am I missing something? What is the proper way to set the posterior Empirical variable, given an exponentially distributed prior? Or am I interpreting the results wrongly?

Kind regards,

Momo


#2

I played with the code. That exponential distribution seems difficult to fit. I’ve found just running the chain for a very long time gets good results (50-100k samples); or using more data points to better concentrate the posteriors.

Do you know how JAGS or PyMC get it to mix well? Do they also use only 1000 burn-in and 4000 samples?


#3

Hey Dustin,

thanks a lot for looking at my problem.For JAGS I am not sure what happens under the hood but for PYMC3 I assume the adaptive tuning of the HMC parameters through NUTS helps.

I continued to play around, eventually replacing the Uniform and Exponential priors with two (differently parametrized) Gammas and settling for an HMC approach with a larger n_steps parameter, which works in some cases.

In general it appears to me as if both HMC and SGHMC struggle if the scale of the different distributional parameters is very different. If I sample random data from a t-distribution with a smaller dof, say 5, the results are satisfying, however for larger values, like 30, the posterior distribution for the dof generally diverges.

I am trying to implement a rather generic Bayesian testing module, for which I do not wish to tune the MCMC hyperparameters for each application. Is this a feasible goal given the current stage of Edward?

Instead of HMC I tried using a Metropolis Hastings sampler, which is however, unlike the PYMC3 counterpart, not adaptive. I am wondering whether a naive approach, like always moving in normally distributed steps (with equal scaling per dimension) through the hyperparameter space for new posterior sample proposals, makes sense for distributional parameters with vastly different scales?

I’d be happy to contribute to Edward if there is something you believe make sense for my usecase which is not implemented yet.

Kind regards


#4

Could you elaborate on the “adaptive” MH sampler?

Following the original paper, it could be useful to try to use whatever JAGS is doing to fit the model. I’m sure that will involve some improved and new implementations, which would be a valuable contribution.


#5

Hey Dustin,

in the pymc3 implementation, if no proposal distribution is given, it defaults back to a zero mean multivariate normal-distribution. New proposal values are generated from the last accepted value + a Gaussian increment. While I could pass Gaussian proposal variables in Edward as far as I understand it these would not be used to sample increments from the current value, but rather proposal values directly.

Additionally every 100 steps the PYMC3 implementation tunes the variance parameter of the zero mean Gaussian by:

def tune(scale, acc_rate):
    """
    Tunes the scaling parameter for the proposal distribution
    according to the acceptance rate over the last tune_interval:
    Rate    Variance adaptation
    ----    -------------------
    <0.001        x 0.1
    <0.05         x 0.5
    <0.2          x 0.9
    >0.5          x 1.1
    >0.75         x 2
    >0.95         x 10
    """

Do you think that would be a useful addition to edward? I’ll have a look at JAGS over the next couple days aswell.


#6

That’s right. Our MH doesn’t do random walk MH. A contribution enabling this is definitely welcome in addition to the variance parameter turning.