HMM Implementation with Marginalized Latent States


#1

Hi everyone. I’ve been trying to build an implementation of a standard Hidden Markov Model in Edward, essentially trying to replicate this example in Stan: Stan HMM Semi-Sup, and inspired by this post here: Gelman’s Blog

The idea is that you simply marginalize out all the latent states using the forward algorithm, which is a dynamic programming algo that yields the likelihood of the data, given the model parameters. However, I’m quite new to Edward and I’m struggling to understand how exactly to structure the inference part of the model.

Consider a very simple HMM with 2 Gaussian emission states with unknown means and variances, and a known & fixed transmission matrix. You have observations of data, and you want to estimate the distributions of the unknown parameters, given your priors, and the data.

I’ve written a custom Edward distribution that that takes the parameters of the emission distributions, computes the emission lattice (the matrix that contains the log-likelihood of each observation, given your emission distributions), and then performs the forward pass to yield an overall HMM log-likelihood. But I’m stuck on how to wrap it all together now!

I’ve verified that individual parts of the code do the right thing by their inclusion in other tools, but there’s still probably bugs here, and I’m a bit hazy on the Edward syntax so if you see something strange, it’s probably because I’ve just copied it from an example…

Code:

from edward.models import RandomVariable
from tensorflow.contrib.distributions import Distribution
from tensorflow.contrib.distributions import FULLY_REPARAMETERIZED

class HMM(RandomVariable, Distribution):
    def __init__(self, transmat, init_p, states, name='', *args, **kwargs):
        self.transmat = transmat
        self.init_p = init_p
        self.states = states

        parameters = locals()
        parameters.pop("self")
        super(HMM, self).__init__(dtype=tf.float32,
                            reparameterization_type=FULLY_REPARAMETERIZED,
                            validate_args=False,
                            allow_nan_stats=False,
                            parameters=parameters,
                            name=name,*args, **kwargs)
    
    def _log_prob(self, value):
    #first we need to compute the (len(obs), len(states)) matrix that holds the
    #likelihood of each observation coming from each state
        B_list = []
        for j in range(0, self.transmat.shape[1]):
            state_probs = self.states[j].log_prob(value)
            B_list.append(state_probs)
        B = tf.squeeze(tf.stack(B_list,axis=1))

    #now we compute the forward pass to yield the log-likelihood of the data given the model. This is done in 
    #log-space to avoid overflow/underflow
      alpha_0 = self.init_p + B[0]
      logprob = tf.scan(lambda alpha, b: tf.reduce_logsumexp(tf.transpose(alpha) + self.transmat, axis=0, keepdims=True) + b,
            initializer=alpha_0, 
            elems=B[1:, :],
            back_prop=True,
            infer_shape=False)
           
   return tf.reduce_logsumexp(logprob[-1])

def _sample_n(self, n, seed=None):
     raise NotImplementedError("_sample_n is not implemented")

Now the part I’m very confused about:

n_emissions = 2
tf.reset_default_graph()
A = tf.Variable(np.log(A_), name="transmat",dtype=tf.float32, trainable=False)
B = tf.Variable(tf.zeros((N, A_.shape[1]), dtype=tf.float32), name="fwd_obs_mat",dtype=tf.float32, trainable=False)
X = tf.placeholder(shape=O.shape, name="data", dtype=tf.float32)
p_init = tf.Variable(np.log(initial_probabilities), name='p_init', dtype=tf.float32, trainable=False)


states = []
s1 = Normal(loc=tf.zeros(1), scale=tf.ones(1))
s2 = Normal(loc=tf.zeros(1), scale=tf.ones(1))
states.append(s1)
states.append(s2)

qw = Normal(loc=tf.get_variable("qw/loc", [1]),
            scale=tf.nn.softplus(tf.get_variable("qw/scale", [1])))
qb = Normal(loc=tf.get_variable("qb/loc", [1]),
            scale=tf.nn.softplus(tf.get_variable("qb/scale", [1])))
                    
loglikelihood = HMM(A_, initial_probabilities,states, value=tf.zeros_like([1], dtype=tf.float32))
                    


with tf.Session() as sess:
    init = tf.global_variables_initializer()
    inference = #what goes here?
    inference.run(n_samples=5, n_iter=250)

How do I complete this example? In Pymc3 I can wrap the function that returns the HMM log-likelhood in a pm.DensityDist or pm.Potential distribution, and then basically treat it like as normal, but I’m very confused as to how to proceed here. For what it’s worth, I’ve managed to create a fully functional version of this code in pymc3/theano, so I’m more or less sure it’s on the right track! Any feedback, including ‘nothing you’re doing makes any sense’, is welcome here.

Cheers!


#2

Ok, I’ve got things mostly working (at least for this very simple example). There were a number of things wrong with the above example:

PART A:

  • The custom RV was incorrectly constructed - after digging around for examples here in discourse and in github, I was able to create a functional example like this:

    class distribution_HMM(Distribution):
      def __init__(self, transmat, init_p, states, name='', *args, **kwargs):
          self.transmat = transmat
          self.init_p = init_p
          self.states = states
          parameters = locals()
          parameters.pop("self")
    
          super(distribution_HMM, self).__init__(dtype=tf.float32,
                                    reparameterization_type=FULLY_REPARAMETERIZED,
                                    validate_args=False,
                                    allow_nan_stats=False,
                                    parameters=parameters,
                                    name=name, *args, **kwargs)
    
      def normal_log_prob(self, X, mu, scale=1.):
          return -.5 * tf.log(2. * np.pi) - tf.log(scale) - (1. / (2. * scale ** 2)) * (X - mu) ** 2
    
      def _log_prob(self, value):
          # first we need to compute the (len(obs), len(states)) matrix that holds the
          # likelihood of each observation coming from each state
          B_list = []
          for j in range(0, self.transmat.shape[1]):
              state_probs = self.normal_log_prob(mu=self.states[j], X=value)
              B_list.append(state_probs)
          B = tf.squeeze(tf.stack(B_list, axis=1))
    
          alpha_0 = self.init_p + B[0]
          logprob = tf.scan(
              lambda alpha, b: tf.reduce_logsumexp(tf.transpose(alpha) + self.transmat, axis=0, keepdims=True) + b,
              initializer=alpha_0,
              elems=B[1:, :],
              back_prop=True,
              parallel_iterations=12,
              infer_shape=False)
          return tf.reduce_logsumexp(logprob[-1])
    
      def _sample_n(self, n, seed=None):
          raise NotImplementedError("_sample_n is not implemented")
    
    class HMM(RandomVariable, distribution_HMM):
      def __init__(self, *args, **kwargs):
          RandomVariable.__init__(self, *args, **kwargs)
    

Couple key points here:

  • Distribution object (distribution_HMM) doesn’t directly subclass RandomVariable like in the API documentation, it seems like it needs to be done separately. Otherwise it leads to really confusing issues where the init() call isn’t correctly passing your parameters through properly.
  • Fixes needed to be made to the math here: the computation of the observation/emission matrix is done by passing the values of the means drawn from the Normal variables (via states[]), then computing the log_probability manually - we’re not actually passing objects around like I originally thought
  • I’m being very lazy for the sake of the example and only worrying about the emission means (just fixing variances), but this is just for the convenience of the example.

PART B:

Just had to think about this one a bit more and read the docs again.

tf.reset_default_graph()
A = tf.placeholder(name="transmat",shape=A_.shape, dtype=tf.float32,)
init_p = tf.placeholder(name='p_in drait',shape=initial_probabilities.shape, dtype=tf.float32)

states = []
s1 = Normal(loc=tf.zeros(1), scale=tf.ones(1))
s2 = Normal(loc=tf.zeros(1), scale=tf.ones(1))
states.append(s1)
states.append(s2)

qw = Normal(loc=tf.get_variable("qw/loc", [1]),
            scale=tf.nn.softplus(tf.get_variable("qw/scale", [1])))
qb = Normal(loc=tf.get_variable("qb/loc", [1]),
            scale=tf.nn.softplus(tf.get_variable("qb/scale", [1])))

loglikelihood = HMM(init_p=init_p,
                    states=states,
                    transmat=A,
                    value=tf.zeros_like(np.zeros((N,1)), dtype=tf.float32))


inference = ed.KLqp({s1: qw, s2: qb}, data={loglikelihood: X,
                                            init_p:np.log(initial_probabilities),
                                            A:np.log(A_)})
inference.run(n_samples=10, n_iter=250)

print('Transmission matrix:', A_)

print('Inferred State 1 Emission Mean:', np.mean(qb.sample(1000).eval()),
      'Inferred State 2 Emission Mean:', np.mean(qw.sample(1000).eval()))

print('Actual State 1 Emission Mean:', m.means_[0],
      'Actual State 2 Emission Mean:', m.means_[1])
print(m.means_)

Result:

We can set up a test problem by starting with a randomly sampled transmission matrix A and uniform initial probabilities.

Then, we build our (true) emission distributions by randomly initialising the Gaussian emissions means from a N(0,10) distribution, and setting variance to 1.0. (So we have 2 emissions, each which is N(loc=np.random.normal(0,10), scale=1), then simulating 1000 samples from the HMM.

The solution successfully runs and produces the following:

 1/250 [  0%]                                ETA: 539s | Loss: 4743.423
  2/250 [  0%]                                ETA: 279s | Loss: 4447.661
  4/250 [  1%]                                ETA: 149s | Loss: 3827.863
  6/250 [  2%]                                ETA: 105s | Loss: 3882.442
  8/250 [  3%]                                ETA: 83s | Loss: 3586.098 
 10/250 [  4%] █                              ETA: 70s | Loss: 3279.219
 12/250 [  4%] █                              ETA: 61s | Loss: 2867.853
 14/250 [  5%] █                              ETA: 55s | Loss: 2830.235
 16/250 [  6%] █                              ETA: 50s | Loss: 2168.889
###################################### More stuff '###############################
246/250 [ 98%] █████████████████████████████  ETA: 0s | Loss: 215.862
248/250 [ 99%] █████████████████████████████  ETA: 0s | Loss: 214.466
250/250 [100%] ██████████████████████████████ Elapsed: 24s | Loss: 213.734

Transmission matrix: [[0.57159576 0.42840424]
 [0.4666764  0.5333236 ]]
Inferred State 1 Emission Mean: -10.578932 Inferred State 2 Emission Mean: 8.5765
Actual State 1 Emission Mean: [-10.69359473] Actual State 2 Emission Mean: [8.54639363]

Not bad! Tricky bit is getting over the Edward/TF syntax, really. I’ve extended this whole approach quite substantially in Pymc3 and found that minibatch ADVI (or better yet, NFVI) works pretty well even when you expand the problem dramatically, such as:

  • Having lots more states with more parameters
  • Not fixing the transmat, but putting priors over bits of it
  • Transforming the variables/putting bounds on them

However, there are some issues remaining. The biggest issue at the moment is that the performance in Edward/TF is abysmal compared to Pymc3/Theano,and the problem seems to come from the TF.scan vs theano.Scan because the tensorflow graph is growing each time the log_p is computed (anybody know how to fix this?). The actual implementation of the forward pass is almost identical (transliterated), but theano is about 2-3 orders of magnitude faster. Hopefully I can get that figured out, because it would be nice to use TF.

The other issue is that if you don’t constrain the problem, the multi-modality of the posterior will result in poor results with ADVI - at least mean-field. This can be somewhat alleviated by using clever normalising flows and priors, but it’s something to watch out for as HMM’s can be highly multimodal.

Thanks for reading!


#3

Cool stuff. Curious if you made more progress on this?

Re: performance, I know at some point theano was much faster for RNNs than TF. Have those issues been fixed now? If so maybe the performance difference might have decreased.

I’ve only used vanilla theano not with pymc3: do they play nice or are data/parameters having to passed to/from the GPU on every ADVI iteration?


#4

Hi David,

I ended up putting aside this toy example as I couldn’t satisfactorily resolve the aforementioned issues. Additionally, this was around the time TensorFlow Probability was announced, and I (correctly) assumed that Edward support would really drop off while the team ramped up on TFP. The next steps, if I revisit things, would be to build this example properly using that toolkit.

Not 100% sure what you mean regarding pymc3 &theano, but they do ‘play nice’ and generally the details (such as passing data to & from the gpu) are abstracted away from the user and taken care of in a optimal fashion. That’s kinda their raison d’etre!