Re-using models/inferences for several independent fits


#1

Hi, thanks for Edward. I ran into Edward because I was trying to see if I could speed up the fitting of many mixed effects models over statsmodels or rpy2->lme4. I am pretty new to TensorFlow, but proficient in Python.

What I am doing is fitting tens of thousands of models, where the “design matrix” of fixed and random effects stays the same, and the only thing that differs each time is the dependent variable. Basically, my problem is equivalent to the mixed effects tutorial, but I am looking for guidance on how to correctly re-use my model/variables when I plug in a new dependent variable, if that is possible.

What I have been doing is:

  1. Initializing my model variables as in the tutorial
  2. For each column in my matrix of dependent variables, Y:
  • set data[y] = np.array(Y[:,j])
  • initialize a new KLqp with data and latent and run inference

My questions are:

Will the variables be “contaminated” from previous loops, if I initialize a new Inference object each time?

Is there a way to re-use the KLqp by initializing it once and just adding my new dependent variable? I see that there is a feed_dict in Inference.update, but if I understand it correctly, it is for iteratively fitting a single model with a data stream, not for “resetting” data.

On that note, I see there is a session.run(inference.reset), but if I try this, I get uninitialized variables errors, even if I put my whole data dict into inference.update. But I am not entirely clear on what “initializing” a variable even means in this context (zeroing it? loading it onto the GPU if one is being used?).

If anyone could point me in the right direction, it’d be greatly appreciated. The reason I am concerned about this is that it seems more time is being spent on (re)initializing the model each time than actually fitting it, and because we are getting in a batch of GPUs and it seems useless to shuttle the same design matrix data back and forth from the GPU thousands of times.


#2

Will the variables be “contaminated” from previous loops, if I initialize a new Inference object each time?

The contract of inference.initialize() is that it only builds more nodes in the graph. It will never destroy information such as reinitialize parameters. This means if you’d like to reinitialize the parameters to avoid “contamination”, you need to call tf.global_variables_initializer().run(). (See Inference's API (http://edwardlib.org/api/ed/Inference).)

Is there a way to re-use the KLqp by initializing it once and just adding my new dependent variable? I see that there is a feed_dict in Inference.update, but if I understand it correctly, it is for iteratively fitting a single model with a data stream, not for “resetting” data.

You’re getting at what I think is the best approach:

  1. Write a tf.placeholder for the data you condition on in the model. Build the model and initialize inference (inference.initialize()).
  2. Initialize parameters ( tf.global_variables_initializer().run()) and update inference in a loop while feeding the placeholder a fixed matrix of dependent variables.
  3. Assess the fit.
  4. Reset inference (sess.run(inference.reset)) and go back to step 2 with a different matrix of dependent variables.

If you did this correctly, you should see a graph built for one model and one inference. (You can check this using TensorBoard.) The only values that can change in the graph are the TensorFlow variables in the model/approximate posterior, internal counters in inference, and the placeholders which you’ll feed in data.

I am not entirely clear on what “initializing” a variable even means in this context (zeroing it? loading it onto the GPU if one is being used?).

Initializing a variable tf.Variable(tf.zeros(5)) means setting its value to tf.zeros(5). If you change its value because of inference, reinitializing the variable will set it back to tf.zeros(5).