Consider a simple **linear regression** model, goal is to determine w from observations X,Y

y = wx + e

where we **assume likelihood** as

P(Y|X,W) = N(WX, I)

And the prior for model parameter contains a parameter that has a hyperprior

P(W|L) = N(0, L)

where the hyperprior is

P(L) = InvGamma(0.1,0.1)

So now if I use Edward, creating the following variational distribution

qW = N(…,…)

qL = N(…,…)

Then I feed the inference as {W:qW, L:qL}. I run the inference, I got result for q(W) and q(L)

**Question**: is q(W,L) = q(W)*q(L)? And we minimize the KL divergence between P(W, L|X,Y) and q(W,L)?

In this above code, since we define qW and qL **separately**, the q(W,L) = q(W)*q(L).

**However**, the above decoupling from the variational function part that we program(defining those things starting with q), is neglecting the dependence between W and L.

Naturally, we should have q(W,L) = q(W|L)*q(L) so q(W) is obtained by marginlised out the L as a postprocessing step.

Any comments?

==================

Is there anyway to encoding dependence in varitional family?