I want to build a Gaussian Process classifier for fairly high dimensional data. The data are originally ~150,000 SNPs (single DNA variants), but I am reducing their dimension using smartpca in eigenstrat. smartpca is a PCA implementation used in population genetics that deals with missing data, prunes the data to avoid correlation due to physical linkage on chromosomes, and normalizes on a per-SNP basis.

I have a reference panel of populations from around the world and they’re classified into 6 regions overall. The reference panel has ~1000+ samples right now.

My aim is to train a GP classifier to distinguish among these 6 world regions, plus perhaps 1 or 2 additional classes of individuals that are of a different sort.

I’m wondering if I should tackle this in Stan or Edward. The full PCA-reduced data has ~1000 dimensions. I am thinking that I should use on the order of ~20 PCs as input into a GP classifier.

I know Edward is still at an early stage, but I’m wondering if a GP classifier would scale better in Edward because it’s built on TensorFlow and may allow me to use more PCs than Stan should there be a need to do so.

Also, I’m fairly new to probabilistic programming (although I’ve been following the development of Stan for some time now) …so, perhaps the learning curve with Stan will be easier to take on because at the moment it has more tutorials, examples, and documentation?

You should use neither. There are many software dedicated to GPs, such as GPy, GPstuff, and GPflow. I recommend one of these as they have a suite of kernels and model-specific algorithms to handle common GP models including GP classification. (For example, I recommend GPy due to its maturity and abundant documentation.)

I recommend a probabilistic programming framework over GP-specific software only if you intend to scaffold GPs into more complex models, or incorporate GPs into a workflow that is more general than model-specific software.

Thanks a ton for your input! Yes, at the moment I’m not aiming to do anything more complex than a GP, so these libraries would be a great option.

I did think at one point that it could be advantageous to couple a sparse factor model (to replace the PCA from smartpca) to a GP classifier, in which case Stan or Edward would be advantageous. At first, though, I will try using the PCA output as input to a GP classifier.

The other potential wrinkle down the line is that each of the 6 world regions have sampled countries and localities within them, so trying to classify unlabeled data to more than just world region and down to say country would be great; in essence, there’s a hierarchy to the labels beyond the 6 world regions. Alternatively, I could use a GP regression instead of a classifier in order to predict an area of the world from where new data/individuals come from. And in either of these cases (more complex/nested labels or regression) there’s still the issue of taking the ~150,000 SNP data and reducing its dimensionality to use it in a GP, which could be handled with a sparse factor model.

Anyhow, a GPy classifier using PCA input seems like a great start, and may accomplish all I really need. Thanks again for your help!!