Dear Edward users,

Iβm trying to use KLqp to estimate an adaptive shrinkage model and noticed that the variance estimate for the outcome was consistently being overestimated. It turns out that the same problem occurs if we fit a simple Bayesian Linear Regression Model where the signal to noise ratio is high.

The following script reproduces the problem:

```
import tensorflow as tf
import edward as ed
import numpy as np
from edward.models import Normal
ed.set_seed(42)
def build_toy_dataset(N, w, noise_sd=0.1, data_sd=1):
D = len(w)
x = np.random.normal(0, data_sd, size=(N, D))
y = np.dot(x, w) + np.random.normal(0, noise_sd, size=N)
return x, y
### Generate the data
# Note that data_sd >> noise_sd
N = 1000
D = 5
w_true = np.random.randn(D)
noise_sd = 0.1
data_sd = 5
X_train, y_train = build_toy_dataset(N, w_true, noise_sd=noise_sd, data_sd=data_sd)
### Define the model
X = tf.placeholder(tf.float32, [N, D])
w = Normal(loc=tf.zeros(D), scale=tf.ones(D))
b = Normal(loc=tf.zeros(1), scale=tf.ones(1))
log_sd = Normal(loc=tf.zeros(1), scale=tf.ones(1))
y = Normal(loc=ed.dot(X, w) + b, scale=tf.exp(log_sd) * tf.ones(N))
qw = Normal(loc=tf.get_variable("qw/loc", [D]),
scale=tf.nn.softplus(tf.get_variable("qw/scale", [D])))
qb = Normal(loc=tf.get_variable("qb/loc", [1]),
scale=tf.nn.softplus(tf.get_variable("qb/scale", [1])))
qlog_sd = Normal(loc=tf.get_variable("qlog_sd/loc", [1]),
scale=tf.nn.softplus(tf.get_variable("qlog_sd/scale", [1])))
### Variational Inference
# Many samples and iterations, just to go sure
inference = ed.KLqp({w: qw, b: qb, log_sd: qlog_sd}, data={X: X_train, y: y_train})
inference.run(n_samples=100, n_iter=5000)
### Print estimates
sess = ed.get_session()
# w is unbiased...
print("w hat: {}".format(sess.run(qw.mean())))
print("w true: {}".format(w_true))
# sd is overestimated..
print("sd hat: {}".format(sess.run(tf.exp(qlog_sd.mean()))))
print("sd true: {}".format(noise_sd))
```

This script yields

```
w hat: [ 0.4973788 -0.13857569 0.6483919 1.52445376 -0.2352934 ]
w true: [ 0.49671415 -0.1382643 0.64768854 1.52302986 -0.23415337]
sd hat: [ 0.53719109]
sd true: 0.1
```

Iβve played around with different specifications but the problem is persistent: I used an InvGamma prior on the variance instead of the log-normal, I used softplus instead of exp to map the variance variable onto the positive real line, Iβve used score function gradients to fit the model, Iβve increased N, the number of samples, and the number of iterationsβ¦ no luck. What *does* work is using MAP inference, but Iβd like to use VI.

Also, interestingly, the problem vanishes if we decrease the variance of the predictors. Specifically, if we set `data_sd = 1`

in the above example the variance is estimated correctly.

Any ideas whatβs happening here? Many thanks!