# Bernoulli distribution parameterized by a mapped Gaussian

Hello,

I am trying to do inference on a hierarchical model. Each observation `h_ij` is assumed to follow a Bernoulli distribution parametrized by `exp(-x_ij ** 2)`, where each `x_ij` is assumed to have come from a Normal centered at `A.dot(z)` where A is a parameter that maps samples `z` from a K-dimensional space to a D-dimensional space. These latent `z`s are assumed to follow a Normal distribution. My goal is to infer `x`, `z` and `A`. For this, I am using Variational Inference, but my losses quickly (after the first iteration typically) go to `nan` or `inf`.

What could be the cause of this problem? Below is the code I am using.

``````A = Normal(loc=tf.zeros([D, K]), scale=2.0 * tf.ones([D, K]))
z = Normal(loc=tf.zeros([N, K]), scale=tf.ones([N, K]))
x = Normal(loc=tf.matmul(A, z, transpose_b=True), scale=tf.ones([D, N]))
h = Bernoulli(probs=tf.exp(-x**2))

sess = ed.get_session()
x_train, h_train, z_train, A_true = sess.run([x, h, z, A])

qA = Normal(loc=tf.Variable(tf.random_normal([D, K])),
scale=tf.nn.softplus(tf.Variable(tf.random_normal([D, K]))))
qz = Normal(loc=tf.Variable(tf.random_normal([N, K])),
scale=tf.nn.softplus(tf.Variable(tf.random_normal([N, K]))))
qx = Normal(loc=tf.Variable(tf.random_normal([D, N])),
scale=tf.nn.softplus(tf.Variable(tf.random_normal([D, N]))))

inference = ed.KLqp({A: qA, z: qz, x: qx}, data={h: h_train})
inference.run(n_iter=500, n_print=100, n_samples=10)``````

The scale can get 0 numerically, because the softplus is 0 for very large, negative numbers. This can destroy the Gaussian density.

Workaround is to lower bound the scale and e.g. set `scale=softplus(x) + 1e-8`.

hth,
justin

Thanks, but the losses are still `nan` after adding that tiny constant to the approximating distributionâ€™s scales.

If I run the `inference` with `debug=True`, I get an exception saying `InvalidArgumentError: Bernoulli/logits/Log:0 : Tensor had Inf values`. How do I deal with this?

That error says Bernoulliâ€™s log prob is going to infinity.

Perhaps try writing Bernoulli with its logits parameterization? You can also look into truncating the parameter values of the Bernoulli with `tf.maximum`/`tf.minimum`.

Hey @dustin, thanks for the response. Using the logits reparameterization doesnâ€™t solve the problem, and truncating the (probability) parameters of the Bernoulli doesnâ€™t either.

Regarding the last solution, I added this line to the model specification part:

``````TINY = 1e-4
p0 = tf.maximum(0. + TINY, tf.minimum(1.0 - TINY, tf.exp(-x**2)))
h = Bernoulli(probs=p0)
``````

meaning the probabilities would never be 0.0 nor 1.0. If I run with `debug=False`, I see the losses go up and down (not `nan` anymore), which I suppose means that inference is not converging. If I set `debug=True`, I get the same exception as before!

I really have no idea what could be wrong here.

Also, perhaps this has something to do with it: when I run inference, I get this warning:

``````/edward-venv/lib/python3.5/site-packages/edward/util/random_variables.py:51: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
not np.issubdtype(value.dtype, np.int) and \``````

Perhaps this is unrelated to your problem but it seems to me that `qA` and `qz` are disconnected from the rest of the model. What I mean is that they are only receiving a gradient from their respective kl-terms with `A` and `z`. Here is why I think that is happening: In `ed.KLqp`, `h` will be copied so that `x` is replaced by `qx`. `qx` is not connected to `qA` or `qz` so these will not receive any gradient from the data. The only other place where `qA` and `qz` could receive gradients (aside from their own kl-terms) is KL[qx || x], but this term also isnâ€™t connected to `qA` or `qz` since neither `qx` nor `x` is.

@bkayalibay Not sure what you mean. How would you take that into consideration in Edward code?

We can look at this code snippet from `edward/inferences/klqp.py`:

``````for x in six.iterkeys(inference.data):
if isinstance(x, RandomVariable):
x_copy = copy(x, dict_swap, scope=scope)
p_log_lik[s] += tf.reduce_sum(
inference.scale.get(x, 1.0) * x_copy.log_prob(dict_swap[x]))
``````

This is where log[p(h | x, z, A)] is calculated. x_copy isnâ€™t connected to qz or qA in any way, so the tensorflow variables related to these random variables also canâ€™t get any gradient from this term. You can verify this by manually making a copy of h and looking at the tensorflow variables that it depends on:

``````h_copy = ed.copy(h, {A: qA, z: qz, x: qx})
ed.get_variables(h_copy)
``````

Next we can look at the point where the kl terms are calculated:

``````kl_penalty = tf.reduce_sum([
tf.reduce_sum(inference.kl_scaling.get(z, 1.0) * kl_divergence(qz, z))
for z, qz in six.iteritems(inference.latent_vars)])
``````

This is just a sum of three kl terms: KL[qA || A] + KL[qz || z] + KL[qx || x]. Here, neither qx nor x depend on qA and qz. In fact, the part " KL[qA || A] + KL[qz || z]" is the only component of the loss function that depends on qA and qz. In other words, qA and qz are having no influence on the inference problem right now.

Iâ€™m not exactly familiar with the model that you are working on so I may be completely off the mark here but I see two ways of solving this problem:

1. Explicitly copying x so that it depends on qA and qz and using this copy during inference:
``````x_copy = ed.copy(x, {A: qA, z: qz})
inference = ed.KLqp({A: qA, z: qz, x_copy: qx}, data={h: h_train})
``````
1. Using the class `ed.ReparameterizationKLqp` instead of `ed.KLqp`. This class will internally do the copying that was done manually in option 1. Mind however that `ed.ReparameterizationKLqp` uses the monte carlo estimate of the kl divergence rather than the analytical solution, so this option might have more noisy gradients than option 1.

I have to admit that Iâ€™m not sure if these options are theoretically sound. They are just two ways that I see of making qA and qz part of the inference process from a purely computational perspective.

Thanks @bkayalibay. I get what you mean. However, while inference seems to converge now, the estimated `A` is always quite bad, using either of the solutions you presented.

Also, `h` has `x` as parent, which has `A` and `z` as parents. So `h` definitely is connected to `A` and `z`. I donâ€™t think `h` should have all `x`, `A` and `z` as parents.

Hey,

I just tried out your example with `D = 2; K = 3; N = 5` and got the behaviour you described right away. I then changed

`h = Bernoulli(probs=tf.exp(-x**2))`

to

`h = Bernoulli(probs=tf.exp(-(x ** 2 + 1e-4)))`.

This make sure that the prob never gets exactly 1. After that, the example ran for >5000 updates with on NaN issues.

1 Like