Consider a simple linear regression model, goal is to determine w from observations X,Y
y = wx + e
where we assume likelihood as
P(Y|X,W) = N(WX, I)
And the prior for model parameter contains a parameter that has a hyperprior
P(W|L) = N(0, L)
where the hyperprior is
P(L) = InvGamma(0.1,0.1)
So now if I use Edward, creating the following variational distribution
qW = N(…,…)
qL = N(…,…)
Then I feed the inference as {W:qW, L:qL}. I run the inference, I got result for q(W) and q(L)
Question: is q(W,L) = q(W)*q(L)? And we minimize the KL divergence between P(W, L|X,Y) and q(W,L)?
In this above code, since we define qW and qL separately, the q(W,L) = q(W)*q(L).
However, the above decoupling from the variational function part that we program(defining those things starting with q), is neglecting the dependence between W and L.
Naturally, we should have q(W,L) = q(W|L)*q(L) so q(W) is obtained by marginlised out the L as a postprocessing step.
Any comments?
==================
Is there anyway to encoding dependence in varitional family?