Maximum likelihood estimation


I apologize if this question is naive, but is it possible to perform maximum likelihood estimation over some set of parameters in a model? This would be equivalent to MAP estimation with an improper prior over the parameters. If so, can one point me to an example or tutorial where this is demonstrated?

I have examined using Edward’s MAP inference for this task; however, this inference requires the parameters be encoded as a RandomVariable.



MLE for the model:

x_i ~ Normal(mu, 1)
N = 100
mu = np.random.normal()
x = np.random.normal(loc=mu, scale=1, size=(N, 1)).astype(np.float32)

px = ed.models.Normal(loc=tf.Variable(tf.zeros([1])) * tf.ones([N, 1]), scale=tf.ones([1]))
inf = ed.KLqp(data={px: x})
print(mu, x.mean(), ed.get_session().run(px.mean()[0, 0]))
1.764052345967664 1.82505 1.82505

The reason this works is because the objective function in ed.KLqp includes the log-likelihood.

You’re probably better off using scipy.optimize to solve maximum likelihood problems because ed.KLqp must use gradient descent (the objective function is stochastic, so we have to estimate gradients via sampling), but MLEs can often be found quicker using second-order methods.


Thank you very much for your response. This answers my question. I am interested in the case in which I would like to approximate a posterior for the latent variables and a point estimate for the parameters.

Upon further investigation, I see that when Edward builds gradients for the loss function, it computes gradients with respect to all variables upstream from the variational random variables in the computation graph.


Yes, that inference algorithm is referred to as VBEM in the docs, although it isn’t iterative like the original formulation (Beal 2003).

The end result should be the same (assuming convergence): a local optimum of the evidence lower bound with respect to the variational parameters and (typically) model hyperparameters.


Ah right, would it be correct to say that VBEM and this algorithm both minimize the same loss function; however VBEM uses coordinate descent whereas this algorithm performs gradient descent? Thanks again


The fundamental idea of VBEM is that, like EM, it monotonically improves a lower bound to the objective function. (In the case of VBEM, the objective function is itself a lower bound.)

You could use any optimization algorithm for the VBE step (depending on whether you could write down the objective function analytically), and potentially a different algorithm for the VBM step.