Biased variance estimates when using KLqp for Bayesian Linear Regression


Dear Edward users,

I’m trying to use KLqp to estimate an adaptive shrinkage model and noticed that the variance estimate for the outcome was consistently being overestimated. It turns out that the same problem occurs if we fit a simple Bayesian Linear Regression Model where the signal to noise ratio is high.

The following script reproduces the problem:

import tensorflow as tf
import edward as ed
import numpy as np

from edward.models import Normal


def build_toy_dataset(N, w, noise_sd=0.1, data_sd=1):
    D = len(w)
    x = np.random.normal(0, data_sd, size=(N, D))
    y =, w) + np.random.normal(0, noise_sd, size=N)
    return x, y

### Generate the data
# Note that data_sd >> noise_sd
N = 1000
D = 5

w_true = np.random.randn(D)
noise_sd = 0.1
data_sd = 5
X_train, y_train = build_toy_dataset(N, w_true, noise_sd=noise_sd, data_sd=data_sd)

### Define the model
X = tf.placeholder(tf.float32, [N, D])
w = Normal(loc=tf.zeros(D), scale=tf.ones(D))
b = Normal(loc=tf.zeros(1), scale=tf.ones(1))
log_sd = Normal(loc=tf.zeros(1), scale=tf.ones(1))
y = Normal(, w) + b, scale=tf.exp(log_sd) * tf.ones(N))

qw = Normal(loc=tf.get_variable("qw/loc", [D]),
            scale=tf.nn.softplus(tf.get_variable("qw/scale", [D])))
qb = Normal(loc=tf.get_variable("qb/loc", [1]),
            scale=tf.nn.softplus(tf.get_variable("qb/scale", [1])))
qlog_sd = Normal(loc=tf.get_variable("qlog_sd/loc", [1]),
            scale=tf.nn.softplus(tf.get_variable("qlog_sd/scale", [1])))

### Variational Inference
# Many samples and iterations, just to go sure
inference = ed.KLqp({w: qw, b: qb, log_sd: qlog_sd}, data={X: X_train, y: y_train}), n_iter=5000)

### Print estimates
sess = ed.get_session()

# w is unbiased...
print("w hat: {}".format(
print("w true: {}".format(w_true))

# sd is overestimated..
print("sd hat: {}".format(
print("sd true: {}".format(noise_sd))

This script yields

w hat: [ 0.4973788  -0.13857569  0.6483919   1.52445376 -0.2352934 ]
w true: [ 0.49671415 -0.1382643   0.64768854  1.52302986 -0.23415337]

sd hat: [ 0.53719109]
sd true: 0.1

I’ve played around with different specifications but the problem is persistent: I used an InvGamma prior on the variance instead of the log-normal, I used softplus instead of exp to map the variance variable onto the positive real line, I’ve used score function gradients to fit the model, I’ve increased N, the number of samples, and the number of iterations… no luck. What does work is using MAP inference, but I’d like to use VI.

Also, interestingly, the problem vanishes if we decrease the variance of the predictors. Specifically, if we set data_sd = 1 in the above example the variance is estimated correctly.

Any ideas what’s happening here? Many thanks!


I think you can fix this by setting:

y = Normal(, w) + b, scale=tf.exp(log_sd))

i.e. deleting the final * tf.ones(N). Here’s the result I get after doing this:

5000/5000 [100%] β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ Elapsed: 89s | Loss: -830.284 
w hat: [ 0.49652234 -0.13839932  0.64960593  1.5230887  -0.23396583]
w true: [ 0.49671415 -0.1382643   0.64768854  1.52302986 -0.23415337]
sd hat: [0.10089288]
sd true: 0.1

Why this change was necessary, I can’t say at the moment. I just thought of deleting that part because I thought it wasn’t necessary.


Thanks for the fast reply! I’ll check out the fix asap and report back.

I should note that I used the * tf.ones(N) expression because throughout the examples, the scale parameter is always broadcast up to match the shape of the location parameter. See e.g. here.

If - for some reason - it is the case that such broadcasting affects the KLqp estimation procedure, it might be useful to document it somewhere, or correct the examples.


Oh, I see. I wasn’t aware of that. To be honest, I’m not exactly sure if it’s necessary or not. I deleted it because a scalar also works in terms of broadcasting, but I didn’t really consider it from any other angle. I need to do some more thinking about this :smile:


A quick update: Just using a tf.random_normal_initializer also seems to solve the problem, like this:

y = Normal(, w) + b, scale=tf.exp(log_sd) * tf.ones(N))

qw = Normal(loc=tf.get_variable("qw/loc", [D], initializer=tf.random_normal_initializer()),
            scale=tf.nn.softplus(tf.get_variable("qw/scale", [D], initializer=tf.random_normal_initializer())))
qb = Normal(loc=tf.get_variable("qb/loc", [1], initializer=tf.random_normal_initializer()),
            scale=tf.nn.softplus(tf.get_variable("qb/scale", [1], initializer=tf.random_normal_initializer())))
qlog_sd = Normal(loc=tf.get_variable("qlog_sd/loc", [1], initializer=tf.random_normal_initializer()),
            scale=tf.nn.softplus(tf.get_variable("qlog_sd/scale", [1], initializer=tf.random_normal_initializer())))

Leads to:

5000/5000 [100%] β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ Elapsed: 112s | Loss: -823.016 
w hat: [ 0.49653494 -0.1384867   0.6493899   1.5232401  -0.23401424]
w true: [ 0.49671415 -0.1382643   0.64768854  1.52302986 -0.23415337]
sd hat: [0.10197183]
sd true: 0.1

However if you both use tf.random_normal_initializer and delete * tf.ones(N), it no longer works:

5000/5000 [100%] β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ Elapsed: 85s | Loss: 7315.100     
w hat: [ 0.5368095  -0.17952083  0.62162423  1.5963566  -0.23201847]
w true: [ 0.49671415 -0.1382643   0.64768854  1.52302986 -0.23415337]
sd hat: [1.7548935]
sd true: 0.1

Also, reducing n_samples to 1 works too, without making any other changes. Unfortunately, I don’t know how to reason about any of this.


I did some testing on my end, and I think the inconsistent results are caused by random variations in the initial conditions for the (stochastic) optimization procedure. That is, using the random normal initializer and removing the broadcasting operation don’t have any substantively meaningful effects, but simply lead to different starting parameters and/or different random samples during optimization.

I tested this by rerunning the Klqp inference step in the above script 100 times (on the same data) and recording the resulting variance estimates, yielding a distribution like this:


Hence, we’re able to recover the true value about half of the time. The bottom line seems to be that the current KLqp implementation frequently gets stuck in local minima when trying to fit a simple BLR model.

Any ideas on how to get around this? Is there a simple way to tweak the tensorflow optimizer to make it more robust to local mimina?

Here’s some more evidence; the log (shifted) loss over all 100 iterations, each iteration having been run for 2000 steps. As you can see, most of the time the optimizer gets stuck at a value of around 8. Only sometimes does it find the global minimum.



You’re absolutely right, sorry about the nonsense earlier. I completely forgot about the fixed seed, so I was getting pretty consistent results.

Regarding the local minimum: I think one thing you could try is to set the initial scales of the approximate posteriors to be low. Like so:

qw = Normal(loc=tf.get_variable("qw/loc", [D]),
            scale=tf.nn.softplus(tf.Variable(tf.zeros(D)-10., name='qw/scale')) + 1e-8)
qb = Normal(loc=tf.get_variable("qb/loc", [1]),
            scale=tf.nn.softplus(tf.get_variable("qb/scale", [1])) + 1e-8)
qlog_sd = Normal(loc=tf.get_variable("qlog_sd/loc", [1]),
            scale=tf.nn.softplus(tf.Variable(-10., name='qlog_sd/scale')) + 1e-8)

I tried recreating your plot with the histogram of sd estimates in this setting with 100 runs, and here’s what I got:

EDIT: I should add that I used tf.train.AdamOptimizer(0.1) as the optimizer. In fact, here is the script that I used


Thanks for your help!

I did some further testing and it turns out that the only thing that determines whether KLqp finds the global minimum is the initial value for the qlog_sd scale, which needs to be very small. The ADAM optimizer and the initial value for the other variables don’t seem to matter.

I’m not quite sure what to make of this. One implication may be that it’s always a good idea to refit the same model with many different initial parameters and then choose the one with the lowest loss. It seems to me that this strategy kind of defeats the purpose of VI, though, which is supposed to be faster than other approaches.