Bernoulli distribution parameterized by a mapped Gaussian



I am trying to do inference on a hierarchical model. Each observation h_ij is assumed to follow a Bernoulli distribution parametrized by exp(-x_ij ** 2), where each x_ij is assumed to have come from a Normal centered at where A is a parameter that maps samples z from a K-dimensional space to a D-dimensional space. These latent zs are assumed to follow a Normal distribution. My goal is to infer x, z and A. For this, I am using Variational Inference, but my losses quickly (after the first iteration typically) go to nan or inf.

What could be the cause of this problem? Below is the code I am using.

A = Normal(loc=tf.zeros([D, K]), scale=2.0 * tf.ones([D, K]))
z = Normal(loc=tf.zeros([N, K]), scale=tf.ones([N, K]))
x = Normal(loc=tf.matmul(A, z, transpose_b=True), scale=tf.ones([D, N]))
h = Bernoulli(probs=tf.exp(-x**2))

sess = ed.get_session()
x_train, h_train, z_train, A_true =[x, h, z, A])

qA = Normal(loc=tf.Variable(tf.random_normal([D, K])),
            scale=tf.nn.softplus(tf.Variable(tf.random_normal([D, K]))))
qz = Normal(loc=tf.Variable(tf.random_normal([N, K])),
            scale=tf.nn.softplus(tf.Variable(tf.random_normal([N, K]))))
qx = Normal(loc=tf.Variable(tf.random_normal([D, N])),
            scale=tf.nn.softplus(tf.Variable(tf.random_normal([D, N]))))

inference = ed.KLqp({A: qA, z: qz, x: qx}, data={h: h_train}), n_print=100, n_samples=10)


The scale can get 0 numerically, because the softplus is 0 for very large, negative numbers. This can destroy the Gaussian density.

Workaround is to lower bound the scale and e.g. set scale=softplus(x) + 1e-8.



Thanks, but the losses are still nan after adding that tiny constant to the approximating distribution’s scales.

If I run the inference with debug=True, I get an exception saying InvalidArgumentError: Bernoulli/logits/Log:0 : Tensor had Inf values. How do I deal with this?


That error says Bernoulli’s log prob is going to infinity.

Perhaps try writing Bernoulli with its logits parameterization? You can also look into truncating the parameter values of the Bernoulli with tf.maximum/tf.minimum.


Hey @dustin, thanks for the response. Using the logits reparameterization doesn’t solve the problem, and truncating the (probability) parameters of the Bernoulli doesn’t either.

Regarding the last solution, I added this line to the model specification part:

TINY = 1e-4
p0 = tf.maximum(0. + TINY, tf.minimum(1.0 - TINY, tf.exp(-x**2)))
h = Bernoulli(probs=p0)

meaning the probabilities would never be 0.0 nor 1.0. If I run with debug=False, I see the losses go up and down (not nan anymore), which I suppose means that inference is not converging. If I set debug=True, I get the same exception as before!

I really have no idea what could be wrong here.


Also, perhaps this has something to do with it: when I run inference, I get this warning:

/edward-venv/lib/python3.5/site-packages/edward/util/ FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  not np.issubdtype(value.dtype, and \


Perhaps this is unrelated to your problem but it seems to me that qA and qz are disconnected from the rest of the model. What I mean is that they are only receiving a gradient from their respective kl-terms with A and z. Here is why I think that is happening: In ed.KLqp, h will be copied so that x is replaced by qx. qx is not connected to qA or qz so these will not receive any gradient from the data. The only other place where qA and qz could receive gradients (aside from their own kl-terms) is KL[qx || x], but this term also isn’t connected to qA or qz since neither qx nor x is.


@bkayalibay Not sure what you mean. How would you take that into consideration in Edward code?


We can look at this code snippet from edward/inferences/

for x in six.iterkeys(
   if isinstance(x, RandomVariable):
      x_copy = copy(x, dict_swap, scope=scope)
      p_log_lik[s] += tf.reduce_sum(
         inference.scale.get(x, 1.0) * x_copy.log_prob(dict_swap[x]))

This is where log[p(h | x, z, A)] is calculated. x_copy isn’t connected to qz or qA in any way, so the tensorflow variables related to these random variables also can’t get any gradient from this term. You can verify this by manually making a copy of h and looking at the tensorflow variables that it depends on:

h_copy = ed.copy(h, {A: qA, z: qz, x: qx})

Next we can look at the point where the kl terms are calculated:

kl_penalty = tf.reduce_sum([
      tf.reduce_sum(inference.kl_scaling.get(z, 1.0) * kl_divergence(qz, z))
      for z, qz in six.iteritems(inference.latent_vars)])

This is just a sum of three kl terms: KL[qA || A] + KL[qz || z] + KL[qx || x]. Here, neither qx nor x depend on qA and qz. In fact, the part " KL[qA || A] + KL[qz || z]" is the only component of the loss function that depends on qA and qz. In other words, qA and qz are having no influence on the inference problem right now.

I’m not exactly familiar with the model that you are working on so I may be completely off the mark here but I see two ways of solving this problem:

  1. Explicitly copying x so that it depends on qA and qz and using this copy during inference:
x_copy = ed.copy(x, {A: qA, z: qz})
inference = ed.KLqp({A: qA, z: qz, x_copy: qx}, data={h: h_train})
  1. Using the class ed.ReparameterizationKLqp instead of ed.KLqp. This class will internally do the copying that was done manually in option 1. Mind however that ed.ReparameterizationKLqp uses the monte carlo estimate of the kl divergence rather than the analytical solution, so this option might have more noisy gradients than option 1.

I have to admit that I’m not sure if these options are theoretically sound. They are just two ways that I see of making qA and qz part of the inference process from a purely computational perspective.


Thanks @bkayalibay. I get what you mean. However, while inference seems to converge now, the estimated A is always quite bad, using either of the solutions you presented.


Also, h has x as parent, which has A and z as parents. So h definitely is connected to A and z. I don’t think h should have all x, A and z as parents.



I just tried out your example with D = 2; K = 3; N = 5 and got the behaviour you described right away. I then changed

h = Bernoulli(probs=tf.exp(-x**2))


h = Bernoulli(probs=tf.exp(-(x ** 2 + 1e-4))).

This make sure that the prob never gets exactly 1. After that, the example ran for >5000 updates with on NaN issues.