Unrelated variable definitions affecting fit performance

Hello, this is my first post! I’m a newcomer to Edward and to a lesser extent Tensorflow, so I’m not sure if this is a basic tensorflow question or an edward question.

In the supervised regression tutorial, a typical MSE and MAE number on the test data is are 0.03 and 0.123 respectively. The variational distributions for the latents are defined exactly as follows:

qw = Normal(loc=tf.get_variable("qw/loc", [D]),
            scale=tf.nn.softplus(tf.get_variable("qw/scale", [D])))
qb = Normal(loc=tf.get_variable("qb/loc", [1]),
            scale=tf.nn.softplus(tf.get_variable("qb/scale", [1])))

(This is different from what’s in the tutorial link. Rather, this definition comes from cell 4 here, which is an interactive version of the tutorial with minor changes; I think the main difference here is that without an initializer, it’s initialized to the glorot uniform.)

However, weirdly enough, you can improve the MSE and MAE to 0.00572143 and 0.0651105 respectively by using the following cell instead of that one:

with tf.variable_scope("scope" 
                       ,reuse=tf.AUTO_REUSE
                      ):
    qwa = Normal(loc=tf.get_variable("qw/loc", [D]),
                scale=tf.nn.softplus(tf.get_variable("qw/scale", [D])))
    qba = Normal(loc=tf.get_variable("qb/loc", [1]),
                scale=tf.nn.softplus(tf.get_variable("qb/scale", [1])))
    qw = Normal(loc=tf.get_variable("qw/loc", [D]),
                scale=tf.square(tf.get_variable("qw/scale", [D])))
    qb = Normal(loc=tf.get_variable("qb/loc", [1]),
                scale=tf.square(tf.get_variable("qb/scale", [1])))
    
#improves mse for some reason

What changed is that I define some unrelated variables using softplus in the scale, but squaring in the actual latents the inference takes as input—namely the dict {w:qw, b:qb}. However, if you comment out the definitions of qwa and qwb, this improvement does not occur!! Something about the unrelated qwa and qwb is doing something funky here.

You can also make the MSE and MAE worse by using this other cell:

with tf.variable_scope("scope" 
                       ,reuse=tf.AUTO_REUSE
                      ):
    qwa = Normal(loc=tf.get_variable("qw/loc", [D]),
                scale=tf.nn.softplus(tf.get_variable("qw/scale", [D])))
    qba = Normal(loc=tf.get_variable("qb/loc", [1]),
                scale=tf.nn.softplus(tf.get_variable("qb/scale", [1])))
    qw = Normal(loc=tf.get_variable("qw/loc2", [D]),
                scale=tf.square(tf.get_variable("qw/scale2", [D])))
    qb = Normal(loc=tf.get_variable("qb/loc2", [1]),
                scale=tf.square(tf.get_variable("qb/scale2", [1])))

which approximately doubles both the MSE and MAE. (Only the names are changed in get_variable.)

What is happening? Is it something strange with the tensorflow graph that I’m not understanding?

Can you verify that the different graph defs aren’t just changing around the random seeds for running the algorithm? Namely, I’m wonder how MSE and MAE change if you ensure all scripts run to convergence.