In my very naive attempt, I just copied the logistic example, and tried re-formatting it a bit. The not so successful attempt can be seen at https://gist.github.com/demodw/a5acafb4025d53f4d3f80f559dcce775. Naively, I tried to just create a vector, theta, which has the linear sum of coefficients (except for the reference group, which I set to index 0). However, this gives me the error,

Multinomial is a class. Similar to SciPy, I recommend calling multinomial which is an instantiated object of that class, e.g., multinomial.logpmf(). (It doesn’t actually matter in this example though.) It also requires specifying the number of trials. So you should do multinomial.logpmf(y, n=1, p=...).

You use D=5 to generate the data but set D=1 for the rest of the code.

You use tf.nn.softmax to convert from a real-vector to a probability vector. tf.nn.softmax requires a matrix as input, where the first dimension represents the batch size. For your example, you calculate the probability vectors one data point at a time. Therefore you should set theta = tf.expand_dims(theta, 0).

Alternatively, you can perform the vectorized operation of theta = tf.matmul(x, w) + b, and then call

For your variational distribution, you specify distributions over C-1 dimensions rather than C dimensions. That makes sense because it’s a simplex, so you only need C-1 dimensions to get the last one. However, that doesn’t play well with tf.nn.softmax which assumes the input is not only a matrix of [batch_size, probabilities], but that each row sums to 1. Therefore you have to input the final column representing 1 - tf.reduce_sum(...) before passing it into tf.nn.softmax.

Alternatively, you can use the ed.to_simplex() utility function. It will convert a real-valued matrix of [N, K-1] to a matrix of probabilities [N, K], where each row sums to 1.

Here’s what the final script looks like:

class HierarchicalSoftmax:
...
def log_prob(self, xs, zs):
...
theta = tf.matmul(x, w) + b
log_lik = tf.reduce_sum(multinomial.logpmf(y, n=1, p=ed.to_simplex(theta)))
return log_lik + log_prior
...
C = 3
N = 40 # num data points
D = 5 # num features
...
qw_mu = tf.Variable(tf.random_normal([D, C-1]))
qw_sigma = tf.nn.softplus(tf.Variable(tf.random_normal([D, C-1])))
qb_mu = tf.Variable(tf.random_normal([C-1]))
qb_sigma = tf.nn.softplus(tf.Variable(tf.random_normal([])))
...
inference.run(n_print=5)

I see. That actually works! My only issue now is that I still have to construct theta data-point by data-point. In my data I have about 200.000 groups (what I call D, combinations, in the Stan model), 1,3 million data points, and a few million parameters, so I do not think it is feasible to use a design matrix. Any tips?