I am learning probabilistic machine learning, and I like an idea to do this in TensorFlow. I saw several ML algos in Edward tutorials, but I have a problem to apply a Gaussian Process classifier to the textual data (with lots of features).

I would greatly appreciate any help with it (a sample, reference, etc.)

I assume you saw the GP classification tutorial? It’s directly applicable to text data.

I’m not sure if people have actually published papers with GP on text before. In practice, one difference with text over other sorts of feature inputs is that it’s very sparse. You can use an embedding matrix/look up table in order to convert the sparse inputs into a set of dense features. Then you place a GP classifier over these features and train the GP and embeddings jointly.

For example:

from edward.models import Bernoulli, MultivariateNormalTriL
from edward.util import rbf
batch_size = 256 # batch size during training
vocabulary = 1000 # vocabulary size
embedding_dim = 100 # dimension of embeddings
X = tf.placeholder(tf.float32, [batch_size, vocabulary])
W = tf.Variable(tf.random_normal([vocabulary, embedding_dim]), name="embedding_matrix")
Z = tf.nn.embedding_lookup(W, X)
f = MultivariateNormalTriL(loc=tf.zeros(N), scale_tril=tf.cholesky(rbf(Z)))
y = Bernoulli(logits=f)

The embedding idea comes from word2vec. My guess is this would do really well as a bag of words classifier. And it’s extendable in all the usual ways (e.g., sequence-based, attention, deeper or just no GPs).

I am using gensim.models.doc2vec to compress features, but something is missing in inference = ed.KLqp({f: qf}, data={…, y: y_ph}). Without an additional param, loss values during training are too big. Please look at this code…

N = X_train.shape[0] # number of data points
M = 256 # batch size
D = X_train.shape[1] # number of features

data = generator([X_train, y_train], M)

X = tf.placeholder(tf.float32, [M, D])
y_ph = tf.placeholder(tf.int32, [M])
f = MultivariateNormalTriL(loc=tf.zeros(M), scale_tril=tf.cholesky(rbf(X)))
y = Bernoulli(logits=f)

You’re trying to do stochastic variational inference with a Gaussian process. Each of the function outputs (a parameter in qf) is associated to a different data point. However, all data points depend on each other through f's covariance matrix so your approach doesn’t apply. You either have to train the GP on the full data set or be a little more sophisticated with inducing variables (see Hensman et al., 2013).