Handling missing values

kkmann · April 12, 2018, 11:12am

Hi,

I do have a rather general question regarding missing values in edward/edward2. In biostatistical applications data is often missing. A standard approach is (multiple) imputation of the missing values before moving to the actual analysis. This is a bit absurd when using edward as the models used for imputation still need to fit in memory and are limited to a few standard implementations (e.g. MICE package in R). A more principaled approach would be to specify of a joint model of observed and missing data where the missing values are treated as latent variables. As far as I understand this is already possible but requires tedious splicing/combination of parameters and data (cf. Stan user manual on missing values). Even for low dimensional models this quickly becomes almost intractable in practice. Is there currently any way of specifying distributions for ‘partially observed tensors’ (cf. discussion https://github.com/greta-dev/greta/issues/117) or are there plans for implementing something like it? I feel that this is extremely important in practice as it would be kind of sad to be forced to still use standard (multiple) imputation techniques before a final analysis with edward. After all, whenever missingness becomes a real issue, the ability to model it properly is extremely important!
It would also be interesting to see whether a fully conditional specification (https://www.tandfonline.com/doi/abs/10.1080/10629360600810434) would be possible with edward as this does not necessarily lead to a valid joint model of the data.

Best,

Kevin

deoxyribose · April 12, 2018, 12:48pm

Maybe you can use the 2. approach Dustin describes here: How to handle missing values in Gaussian Matrix Factorization

kkmann · April 12, 2018, 1:23pm

Thanks for the quick reply! Well, that’s possible in principle but think about a 50 dimensional longitudinal dataset with missing values in all variables and it quickly get extremely ugly to do so in practice. I just don’t see how this solution scales - or maybe I just missed the point

deoxyribose · April 12, 2018, 1:38pm

If your model already has local latent variables, and you can do mini-batch SGD, it should scale about as well as anything.

kkmann · April 12, 2018, 1:56pm

Sorry for being imprecise: the issue will not be scaling computationally but programmatically. There will be huge amounts of boiler-plate code if you need to do this manual splicing for more than 3 variables.

deoxyribose · April 12, 2018, 3:21pm

Ah, ok. Are you sure? The index variable I, in the example, is supposed to have the same shape as your training data. So if your data is N x 50, so is I - it will be zero everywhere except at the indices where data is missing, there it will be 1.

Topic		Replies	Views
Matrix factorization - recovering latent factors	1	851	January 27, 2018
Implementing Cross-Validation in Edward Mixture Models	0	718	April 17, 2018
Factor graph + belief prop model in Edward?	1	1618	March 1, 2022
Error using Gibbs sampling for inference in a probabilistic graphical model	2	1220	May 28, 2017
Edward for Sequential Importance Resampling Particle Filter	7	2034	October 17, 2018

Handling missing values

Related Topics