Rookie problem (KLqp gets obviously wrong result)

I just started looking at Edward. It looks very nice. To those of you who had a hand in creating it, thank you very much!

Unfortunately I am having a problem with making the inferencing work, even in a very simple case. I tried running the following code:

import tensorflow as tf
import edward as ed
from edward.models import Normal

print ed.__version__
print tf.__version__

# Generative model
A = Normal(0., 1., name='A')
B = Normal(A, 1., name='B')
C = Normal(A, 1., name='C')

# Variational model
mu = tf.Variable(0., name='mu')
sigma = tf.Variable(1., name='sigma')
qB = Normal(mu, sigma, name='qB')
with tf.Session() as sess:
    inference = ed.KLqp({B: qB}, {C: 100.})
    inference.run()
    print mu.eval(), sigma.eval()

The output is

1.3.3
1.3.0
1000/1000 [100%] ██████████████████████████████ Elapsed: 1s | Loss: 5024.725
0.0159369 1.0

Given that C is observed to be 100, the posterior of B should be centered at 50, isn’t it? Edward does not seem to get anywhere close to figuring this out. (I just posted this particular example but I also tried less extreme ones interactively and the inference result always seemed random.)

I am assuming that I am doing something basic wrong. Please advise. Thank you!

You wrote a DAG A -> {B, C}, which implies B and C are conditionally independent given A. If you’d like to see how C’s observations affect the distribution of B, you need to simultaneously infer A and B.

Thank you very much! I understand now what you meant by “Latent variables can be defined in the model without any posterior inference over them. They are implicitly marginalized out with a single sample” in your paper “Edward: A library for probabilistic modeling, inference, and criticism” :slight_smile:

This leaves me with the problem of how to express my inference. I would like to get an approximation that captures the correlation between A and B. (I think the terminology is that I would like “full-rank” rather than “mean-field” variational inference.) The obvious way to do that would be to create a bivariate normal qAB (thus, with a total of five inference parameters: two for the means and three for the covariance matrix). But I don’t know how to express that in the dictionary syntax where the two keys A and B have to map to qAB jointly.

This is of course only a model problem for the actual problems I would like to solve with Edward. I can analytically solve for the posterior of A and B given C in this toy example. However, I foresee wanting to use Edward in a number of situations where I’d like to get correlation information about unknowns that are naturally expressed as different variables.

Thank you!

For VI algorithms which require a parametric approximating family, you need to write the joint normal as factorized, either as q(A | B)q(B) or conversely. For inference algorithms such as MCMC, the drawn samples will always be correlated.

That makes sense, thank you! Now I think I understand it in theory.

In practice, there seems to be a problem still. For my original example, I implemented your original suggestion as follows:

import tensorflow as tf
import edward as ed
from edward.models import Normal

# Generative model
A = Normal(0., 1., name='A')
B = Normal(A, 1., name='B')
C = Normal(A, 1., name='C')

# Variational model
muA = tf.Variable(0., name='muA')
qA = Normal(muA, 1., name='qA')

muB = tf.Variable(0., name='muB')
qB = Normal(muB, 1., name='qB')

with tf.Session() as sess:
    inference = ed.KLqp({A: qA, B: qB}, {C: 4.})
    inference.run()
    print muA.eval(), muB.eval()

(To further clarify the example, I replaced all variances by 1, and reduced the C-observation from 100 to 4 to make sure that there are no problems caused by being in the extreme tail of the distribution.) Conditioning upon C = 4 I would expect the means of both A and B to be 2. The output I got (using edward 1.3.4 and tensorflow 1.3.0) was:

1000/1000 [100%] ██████████████████████████████ Elapsed: 2s | Loss: 13.069
1.9406 0.0122407

Thus it seems that the mean of A is correctly inferred to be more-or-less 2, but still it gives me 0 for B.

What’s going on here? How should I change this example to correctly infer the mean of B?

Thank you very much in advance!

Hi Iris,

I don’t believe you did anything wrong and that the issue is in the KLqp inference code. When both of the latent variables are Gaussian, the code tries to use the analytic form of the KL divergence, which does not work in this case. The ReparameterizationKLqp class skips this check and will work.

Try:

inference = ed.ReparameterizationKLqp({A: qA,B: qB}, {C: 4.})
inference.run()
print(muA.eval(), muB.eval())

The output I get:

1000/1000 [100%] ██████████████████████████████ Elapsed: 2s | Loss: 7.463
1.88677 2.21795

You can get more accurate results by asking for more samples to estimate the gradients:

inference = ed.ReparameterizationKLqp({A: qA,B: qB}, {C: 4.})
inference.initialize(n_samples=100)
inference.run()
print(muA.eval(), muB.eval())

which gives:

1000/1000 [100%] ██████████████████████████████ Elapsed: 18s | Loss: 5.434
2.0846 2.01117

Cheers,
Guy

Good catch! Yes, it looks like the issue is in this line. We should be calculating KL’s in expectation over multiple samples to account for dependent q’s.

That does it. Thank you very much!