I want to build a Gaussian Process classifier for fairly high dimensional data. The data are originally ~150,000 SNPs (single DNA variants), but I am reducing their dimension using smartpca in eigenstrat. smartpca is a PCA implementation used in population genetics that deals with missing data, prunes the data to avoid correlation due to physical linkage on chromosomes, and normalizes on a per-SNP basis.
I have a reference panel of populations from around the world and they’re classified into 6 regions overall. The reference panel has ~1000+ samples right now.
My aim is to train a GP classifier to distinguish among these 6 world regions, plus perhaps 1 or 2 additional classes of individuals that are of a different sort.
I’m wondering if I should tackle this in Stan or Edward. The full PCA-reduced data has ~1000 dimensions. I am thinking that I should use on the order of ~20 PCs as input into a GP classifier.
I know Edward is still at an early stage, but I’m wondering if a GP classifier would scale better in Edward because it’s built on TensorFlow and may allow me to use more PCs than Stan should there be a need to do so.
Also, I’m fairly new to probabilistic programming (although I’ve been following the development of Stan for some time now) …so, perhaps the learning curve with Stan will be easier to take on because at the moment it has more tutorials, examples, and documentation?
Looking forward to your feedback!!