Edward for NLP tasks

misha · June 19, 2017, 9:07pm

Hi All,

I am learning probabilistic machine learning, and I like an idea to do this in TensorFlow. I saw several ML algos in Edward tutorials, but I have a problem to apply a Gaussian Process classifier to the textual data (with lots of features).

I would greatly appreciate any help with it (a sample, reference, etc.)

Thanks!

dustin · June 20, 2017, 3:35am

I assume you saw the GP classification tutorial? It’s directly applicable to text data.

I’m not sure if people have actually published papers with GP on text before. In practice, one difference with text over other sorts of feature inputs is that it’s very sparse. You can use an embedding matrix/look up table in order to convert the sparse inputs into a set of dense features. Then you place a GP classifier over these features and train the GP and embeddings jointly.

For example:

from edward.models import Bernoulli, MultivariateNormalTriL
from edward.util import rbf

batch_size = 256  # batch size during training
vocabulary = 1000  # vocabulary size
embedding_dim = 100  # dimension of embeddings

X = tf.placeholder(tf.float32, [batch_size, vocabulary])
W = tf.Variable(tf.random_normal([vocabulary, embedding_dim]), name="embedding_matrix")
Z = tf.nn.embedding_lookup(W, X)

f = MultivariateNormalTriL(loc=tf.zeros(N), scale_tril=tf.cholesky(rbf(Z)))
y = Bernoulli(logits=f)

The embedding idea comes from word2vec. My guess is this would do really well as a bag of words classifier. And it’s extendable in all the usual ways (e.g., sequence-based, attention, deeper or just no GPs).

misha · June 20, 2017, 8:09pm

Thank you, Dustin! I am working on it…

misha · June 21, 2017, 3:33pm

Dustin,

I am using gensim.models.doc2vec to compress features, but something is missing in inference = ed.KLqp({f: qf}, data={…, y: y_ph}). Without an additional param, loss values during training are too big. Please look at this code…

N = X_train.shape[0] # number of data points
M = 256 # batch size
D = X_train.shape[1] # number of features

data = generator([X_train, y_train], M)

X = tf.placeholder(tf.float32, [M, D])
y_ph = tf.placeholder(tf.int32, [M])
f = MultivariateNormalTriL(loc=tf.zeros(M), scale_tril=tf.cholesky(rbf(X)))
y = Bernoulli(logits=f)

qf = Normal(loc=tf.Variable(tf.random_normal([M])),
…scale=tf.nn.softplus(tf.Variable(tf.random_normal([M]))))

n_batch = int(N / M)
n_epoch = 5

inference = ed.KLqp({f: qf}, data={…, y: y_ph})
inference.initialize(n_iter=n_batch * n_epoch, n_samples=5, scale={y: N / M})
tf.global_variables_initializer().run()

for _ in range(inference.n_iter):
…X_batch, y_batch = next(data)
…info_dict = inference.update({X: X_batch, y_ph: y_batch})
…inference.print_progress(info_dict)

Thanks!

dustin · June 21, 2017, 6:35pm

You’re trying to do stochastic variational inference with a Gaussian process. Each of the function outputs (a parameter in qf) is associated to a different data point. However, all data points depend on each other through f's covariance matrix so your approach doesn’t apply. You either have to train the GP on the full data set or be a little more sophisticated with inducing variables (see Hensman et al., 2013).

Topic		Replies	Views
Why Edward for Gaussian processes?	2	1457	February 25, 2018
Variational Gaussian Process (GP) Regression / 'Tensor' object has no attribute 'log_prob'	3	1887	March 28, 2018
Tutorial for multiple variational methods using Poisson regression?	4	1178	May 29, 2017
High dimensional Gaussian Process classifier: Edward or Stan?	2	1776	April 21, 2017
Simple GP classification	1	773	August 25, 2017

Edward for NLP tasks

Related topics