Plan for Hierarchical Variational Model? What is Edward inferring when there is hyperprior in linear regression?


Consider a simple linear regression model, goal is to determine w from observations X,Y

y = wx + e

where we assume likelihood as
P(Y|X,W) = N(WX, I)

And the prior for model parameter contains a parameter that has a hyperprior
P(W|L) = N(0, L)
where the hyperprior is
P(L) = InvGamma(0.1,0.1)

So now if I use Edward, creating the following variational distribution
qW = N(…,…)
qL = N(…,…)

Then I feed the inference as {W:qW, L:qL}. I run the inference, I got result for q(W) and q(L)
Question: is q(W,L) = q(W)*q(L)? And we minimize the KL divergence between P(W, L|X,Y) and q(W,L)?
In this above code, since we define qW and qL separately, the q(W,L) = q(W)*q(L).

However, the above decoupling from the variational function part that we program(defining those things starting with q), is neglecting the dependence between W and L.

Naturally, we should have q(W,L) = q(W|L)*q(L) so q(W) is obtained by marginlised out the L as a postprocessing step.

Any comments?

Is there anyway to encoding dependence in varitional family?


After inquiry to and thanks to Dustin, and also reading a few other papers, I found the answer to my question.

So in such a hyperprior setting, say we have parameter w and prior parameter alpha for w prior, it is often assumed that

Q(w,alpha) = Q(w) Q(alpha)

This is actually the default behavior of Edward when you try to implement VI in a lot of non-official tutorials.

Ok, then is this assumption perfect? Consider a linear regression problem with ARD.

Bayes rule:
= P(D|w, alpha) P(w,alpha)/P(D)
= P(D|w) P(w | alpha) P(alpha) /P(D)

So, we observe the following

  1. the first term only depends on w, the third term only depends on alpha
  2. the second term is a function of both w and alpha

Thus, this assumption is not perfect regarding the second term.

But this assumption is not stated in Edward/ADVI paper even they have examples showing the VI on problems with hyperprior (ARD problems). But it is stated in the following paper [1] that it is an assumption.