Handling missing values



I do have a rather general question regarding missing values in edward/edward2. In biostatistical applications data is often missing. A standard approach is (multiple) imputation of the missing values before moving to the actual analysis. This is a bit absurd when using edward as the models used for imputation still need to fit in memory and are limited to a few standard implementations (e.g. MICE package in R). A more principaled approach would be to specify of a joint model of observed and missing data where the missing values are treated as latent variables. As far as I understand this is already possible but requires tedious splicing/combination of parameters and data (cf. Stan user manual on missing values). Even for low dimensional models this quickly becomes almost intractable in practice. Is there currently any way of specifying distributions for ‘partially observed tensors’ (cf. discussion https://github.com/greta-dev/greta/issues/117) or are there plans for implementing something like it? I feel that this is extremely important in practice as it would be kind of sad to be forced to still use standard (multiple) imputation techniques before a final analysis with edward. After all, whenever missingness becomes a real issue, the ability to model it properly is extremely important!
It would also be interesting to see whether a fully conditional specification (https://www.tandfonline.com/doi/abs/10.1080/10629360600810434) would be possible with edward as this does not necessarily lead to a valid joint model of the data.




Maybe you can use the 2. approach Dustin describes here: How to handle missing values in Gaussian Matrix Factorization


Thanks for the quick reply! Well, that’s possible in principle but think about a 50 dimensional longitudinal dataset with missing values in all variables and it quickly get extremely ugly to do so in practice. I just don’t see how this solution scales - or maybe I just missed the point :wink:


If your model already has local latent variables, and you can do mini-batch SGD, it should scale about as well as anything.


Sorry for being imprecise: the issue will not be scaling computationally but programmatically. There will be huge amounts of boiler-plate code if you need to do this manual splicing for more than 3 variables.


Ah, ok. Are you sure? The index variable I, in the example, is supposed to have the same shape as your training data. So if your data is N x 50, so is I - it will be zero everywhere except at the indices where data is missing, there it will be 1.