KLqp disturbed by irrelevant distribution?

Hi everybody,

I’m trying to understand the behavior of KLqp. I set up a minimal example illustrating the behavior, that confuses me. The setup is basically a trivial regression problem, in which input and output data differ by a constant (+normal distributed noise).
The only difficulty for the algorithm is to infer the scale of the noise as well. Here is the full code that does what I expect it to do:

import tensorflow as tf
import edward as ed
from edward.models import Normal, InverseGamma
import numpy as np

X = tf.placeholder(tf.float32, [None, 1])

sigma = InverseGamma(concentration=tf.ones((1,)), rate=tf.ones((1,)))
delta = Normal(loc=tf.ones(1,), scale=tf.ones(1,))

output = Normal(loc=X + delta, scale=sigma)
#output = Normal(loc=X + delta, scale=0.1)

# create variational distributions
q_concentration = tf.Variable(tf.zeros((1,)))
q_rate = tf.Variable(tf.zeros((1,)))
# exponentiate the variables to ensure positivity
q_sigma = InverseGamma(concentration=tf.exp(q_concentration),
                       rate=tf.exp(q_rate))

q_loc = tf.Variable(tf.zeros((1,)))
q_scale = tf.Variable(tf.zeros((1,)))
# exponentiate the scale to ensure positivity
q_delta = Normal(loc=q_loc, scale=tf.exp(q_scale))

# create training data
X_train = np.random.randn(1000, 1)
Y_train = X_train + 1.25 + np.random.randn(1000, 1)*0.1

# start inference
inference = ed.KLqp({sigma: q_sigma,
                     delta: q_delta},
                    data={X: X_train,
                          output: Y_train})
inference.run(logdir="log", n_iter=10000)

# print results
print("delta: mean", q_delta.mean().eval(), "std", np.sqrt(q_delta.variance().eval()))
print("sigma: mean", q_sigma.mean().eval(), "std", np.sqrt(q_sigma.variance().eval()))
print("sigma: concentration", q_sigma.concentration.eval(), "rate" , q_sigma.rate.eval())

The result is spot-on, indicated by the output

delta: mean [1.2460464] std [0.00939288]
sigma: mean [0.1000579] std [0.02421493]
sigma: concentration [19.074036] rate [1.8084501]

The algorithm recovers the constant shift and the scale of the noise rather well.

Now the only change is the following

output = Normal(loc=X + delta, scale=sigma)
#output = Normal(loc=X + delta, scale=0.1)

to

#output = Normal(loc=X + delta, scale=sigma)
output = Normal(loc=X + delta, scale=0.1)

So I set the scale of the noise to the true value. Supposedly, this should make the life of the algorithm easier. There is no interaction between the sigma and delta anymore and the posterior distribution of sigma just needs to recover the prior. But now I get the output

delta: mean [1.2473155] std [0.05659465]
sigma: mean [nan] std [nan]
sigma: concentration [1.] rate [1.]

So the standard deviation of the estimate of delta increased by a factor of 6. Shouldn’t it be lower or at least of roughly the same magnitude? In a more complicated model this effect completely ruins the estimates of my latent variables, so need to get to the bottom of this.

If I take sigma out of the inference by changing

# start inference
inference = ed.KLqp({sigma: q_sigma,
                     delta: q_delta},
                    data={X: X_train,
                          output: Y_train})

to

# start inference
inference = ed.KLqp(
                    {delta: q_delta},
                    data={X: X_train,
                          output: Y_train})

results get better again, but still slightly worse than originally

delta: mean [1.2501105] std [0.01766331]
sigma: mean [nan] std [nan]
sigma: concentration [1.] rate [1.]

I am using Edward version 1.3.5 and tensorflow version 1.7.0 in python 3.6.3 (WinPython disttribution)

Any help is appreciated. Thanks