Bernoulli distribution parameterized by a mapped Gaussian

pedrofale · January 18, 2018, 3:25pm

Hello,

I am trying to do inference on a hierarchical model. Each observation h_ij is assumed to follow a Bernoulli distribution parametrized by exp(-x_ij ** 2), where each x_ij is assumed to have come from a Normal centered at A.dot(z) where A is a parameter that maps samples z from a K-dimensional space to a D-dimensional space. These latent zs are assumed to follow a Normal distribution. My goal is to infer x, z and A. For this, I am using Variational Inference, but my losses quickly (after the first iteration typically) go to nan or inf.

What could be the cause of this problem? Below is the code I am using.

A = Normal(loc=tf.zeros([D, K]), scale=2.0 * tf.ones([D, K]))
z = Normal(loc=tf.zeros([N, K]), scale=tf.ones([N, K]))
x = Normal(loc=tf.matmul(A, z, transpose_b=True), scale=tf.ones([D, N]))
h = Bernoulli(probs=tf.exp(-x**2))

sess = ed.get_session()
x_train, h_train, z_train, A_true = sess.run([x, h, z, A])

qA = Normal(loc=tf.Variable(tf.random_normal([D, K])),
            scale=tf.nn.softplus(tf.Variable(tf.random_normal([D, K]))))
qz = Normal(loc=tf.Variable(tf.random_normal([N, K])),
            scale=tf.nn.softplus(tf.Variable(tf.random_normal([N, K]))))
qx = Normal(loc=tf.Variable(tf.random_normal([D, N])),
            scale=tf.nn.softplus(tf.Variable(tf.random_normal([D, N]))))

inference = ed.KLqp({A: qA, z: qz, x: qx}, data={h: h_train})
inference.run(n_iter=500, n_print=100, n_samples=10)

bayerj · January 21, 2018, 10:02am

The scale can get 0 numerically, because the softplus is 0 for very large, negative numbers. This can destroy the Gaussian density.

Workaround is to lower bound the scale and e.g. set scale=softplus(x) + 1e-8.

hth,
justin

pedrofale · January 21, 2018, 6:43pm

Thanks, but the losses are still nan after adding that tiny constant to the approximating distribution’s scales.

If I run the inference with debug=True, I get an exception saying InvalidArgumentError: Bernoulli/logits/Log:0 : Tensor had Inf values. How do I deal with this?

dustin · January 22, 2018, 8:40am

That error says Bernoulli’s log prob is going to infinity.

Perhaps try writing Bernoulli with its logits parameterization? You can also look into truncating the parameter values of the Bernoulli with tf.maximum/tf.minimum.

pedrofale · January 22, 2018, 1:29pm

Hey @dustin, thanks for the response. Using the logits reparameterization doesn’t solve the problem, and truncating the (probability) parameters of the Bernoulli doesn’t either.

Regarding the last solution, I added this line to the model specification part:

TINY = 1e-4
p0 = tf.maximum(0. + TINY, tf.minimum(1.0 - TINY, tf.exp(-x**2)))
h = Bernoulli(probs=p0)

meaning the probabilities would never be 0.0 nor 1.0. If I run with debug=False, I see the losses go up and down (not nan anymore), which I suppose means that inference is not converging. If I set debug=True, I get the same exception as before!

I really have no idea what could be wrong here.

pedrofale · January 22, 2018, 2:58pm

Also, perhaps this has something to do with it: when I run inference, I get this warning:

/edward-venv/lib/python3.5/site-packages/edward/util/random_variables.py:51: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  not np.issubdtype(value.dtype, np.int) and \

bkayalibay · January 23, 2018, 2:59pm

Perhaps this is unrelated to your problem but it seems to me that qA and qz are disconnected from the rest of the model. What I mean is that they are only receiving a gradient from their respective kl-terms with A and z. Here is why I think that is happening: In ed.KLqp, h will be copied so that x is replaced by qx. qx is not connected to qA or qz so these will not receive any gradient from the data. The only other place where qA and qz could receive gradients (aside from their own kl-terms) is KL[qx || x], but this term also isn’t connected to qA or qz since neither qx nor x is.

pedrofale · January 23, 2018, 4:03pm

@bkayalibay Not sure what you mean. How would you take that into consideration in Edward code?

bkayalibay · January 23, 2018, 6:43pm

We can look at this code snippet from edward/inferences/klqp.py:

for x in six.iterkeys(inference.data):
   if isinstance(x, RandomVariable):
      x_copy = copy(x, dict_swap, scope=scope)
      p_log_lik[s] += tf.reduce_sum(
         inference.scale.get(x, 1.0) * x_copy.log_prob(dict_swap[x]))

This is where log[p(h | x, z, A)] is calculated. x_copy isn’t connected to qz or qA in any way, so the tensorflow variables related to these random variables also can’t get any gradient from this term. You can verify this by manually making a copy of h and looking at the tensorflow variables that it depends on:

h_copy = ed.copy(h, {A: qA, z: qz, x: qx})
ed.get_variables(h_copy)

Next we can look at the point where the kl terms are calculated:

kl_penalty = tf.reduce_sum([
      tf.reduce_sum(inference.kl_scaling.get(z, 1.0) * kl_divergence(qz, z))
      for z, qz in six.iteritems(inference.latent_vars)])

This is just a sum of three kl terms: KL[qA || A] + KL[qz || z] + KL[qx || x]. Here, neither qx nor x depend on qA and qz. In fact, the part " KL[qA || A] + KL[qz || z]" is the only component of the loss function that depends on qA and qz. In other words, qA and qz are having no influence on the inference problem right now.

I’m not exactly familiar with the model that you are working on so I may be completely off the mark here but I see two ways of solving this problem:

Explicitly copying x so that it depends on qA and qz and using this copy during inference:

x_copy = ed.copy(x, {A: qA, z: qz})
inference = ed.KLqp({A: qA, z: qz, x_copy: qx}, data={h: h_train})

Using the class ed.ReparameterizationKLqp instead of ed.KLqp. This class will internally do the copying that was done manually in option 1. Mind however that ed.ReparameterizationKLqp uses the monte carlo estimate of the kl divergence rather than the analytical solution, so this option might have more noisy gradients than option 1.

I have to admit that I’m not sure if these options are theoretically sound. They are just two ways that I see of making qA and qz part of the inference process from a purely computational perspective.

pedrofale · January 26, 2018, 11:25am

Thanks @bkayalibay. I get what you mean. However, while inference seems to converge now, the estimated A is always quite bad, using either of the solutions you presented.

pedrofale · January 26, 2018, 12:42pm

Also, h has x as parent, which has A and z as parents. So h definitely is connected to A and z. I don’t think h should have all x, A and z as parents.

bayerj · February 11, 2018, 8:32pm

Hey,

I just tried out your example with D = 2; K = 3; N = 5 and got the behaviour you described right away. I then changed

h = Bernoulli(probs=tf.exp(-x**2))

to

h = Bernoulli(probs=tf.exp(-(x ** 2 + 1e-4))).

This make sure that the prob never gets exactly 1. After that, the example ran for >5000 updates with on NaN issues.

Topic		Replies	Views
Simple Hierarchical Distribution	2	1550	October 12, 2017
Beta binomial example	5	1876	August 14, 2017
Combining Bernoulli variables	3	1133	July 20, 2017
Inf and nan loss in Stochastic Volatility Models	2	2192	July 30, 2017
KLqp with Beta/Bernoulli (Coin Flip example)	1	979	September 13, 2018

Bernoulli distribution parameterized by a mapped Gaussian

Related topics