I’m successfully using a Mixture Density Network, based on the MDN tutorial code. However, I’d like to modify the loss function, to try to get different behavior from the network. For example, I’d like the model to prefer more Gaussians, with larger weights and smaller standard deviations, rather than few Gaussians with larger standard deviations.

I think I could accomplish this by adding one or more regularization terms to the loss function. However, the existing maximum likelihood loss appears to be hidden by the MAP inference method.

Once you’ve decided on MAP (with gradient descent) as the inference algorithm, the only thing you can change is the model. In particular, you might try increasing the number of mixture components, constraining the minimum of the standard deviations, or writing manual networks for which you can place priors to penalize the weights.

How would I do something like this? The only thing that comes to mind is Tensorflow clipping operators, but I don’t think that would supply hard constraints.

Ok, can I somehow modify the network to provide a prior for the scales (or weights)? Would I need to use a different inference algorithm if I did this?

You would use the same algorithm. You need to rewrite the model, where you don’t rely on high-level wrappers to write the neural net layers. Instead you write weights with priors such as in the getting started example.

Thanks for your response. I think I understand the idea of building a manual NN with priors on the weights.

However, I’d just like to put a prior on the outputs of that NN (e.g. the scales or logits). For example, if I could say that the scales are from some gamma distribution, which has high density for small values.

Is it possible to do this, or would I still need to build a manual NN with priors on the weights?

(I’m finding this meshing of neural networks and probabilistic models very confusing… would it be proper to cast this MDN as a form of VAE, where the NN is the inference network, and generation network is simply sampling from the mixture?)

A MDN has all likelihood parameters be outputted by a neural network. What you’re describing is a neural network outputting, say, the location parameter but not the scale parameter of a normal distribution.

y = Normal(loc=neural_network(X), scale=scale)

You can specify scale with a LogNormal or Gamma prior.

Ok, thanks for the clarification. What you’re saying makes perfect sense!

Just thinking about the MDN as a VAE… could I simply replace the VAE’s single z Normal distribution, with a mixture of Normal distributions (parameterised like an MDN)? Would it just work?