The likelihood x's log-density has mean given by a Poisson variable z. The Poisson variable itself depends on the parameters you aim to optimize over. This means gradient-based optimization fails because it tries to apply chain rule through the Poisson sample.

To use MAP on such a model you need to write a compound random variable x which analytically marginalizes out the discrete variable, i.e., the density is given by

I modified the post in a way that I am doing inference but still don’t converge… why is it running but failing?

Can you guide me a little bit on what should I write to be able to do inference? There is no close form analytical solution for the sum you have just written !

is the problem the same as in stan with discrete random variables? (Stan can’t sample from them, here the problem is computing the gradient)

This ultimately comes down to what you’re trying to model. Namely, is z observed or latent?

If z is observed as in the edited first post, then MAP works fine. You can double check that it converges to the true parameters using large enough simulated data. If it doesn’t, can you provide a minimal working example?

If z is latent as in the unedited first post, then inference is a lot more difficult because of the intractable likelihood. You have to use ABC methods, which could be as crude as approximating the likelihood. Alternatively, you can perform posterior inference by inferring z as well; this makes the likelihood and priors all tractable but changes the modeling problem.

If z is observed, then Stan and Edward can both easily handle the problem. If z is latent, Stan doesn’t work as you note, but Edward does.

I can write the variational inference problem and compute update rules for (z, rate, mu, sigma). I am trying to write down different Edward algorithms and they do not seem to converge to the right solution (probably because I am new to Edward).

For example, by doing
qz = Empirical( tf.nn.softplus(tf.Variable(tf.random_normal([size, size, 1]))), )
and
inference = ed.MAP({rate:qrate, mu:emu, sigma:qsigma}, data={x: x_train, z:qz})
I thought I was doing EM correctly.

Can you post any algorithm (MAP, VB, EM) that would perform inference in the problem?

only specifies the M-step. You still need to specify some inference algorithm to do the E-step and then alternate between the two during training. One approach could be

where qz is a Poisson approximating family. If you want Monte Carlo for the E-step, you have to use a non-gradient based MCMC algorithm such as Metropolis-Hastings.