product of experts: Nonlinear Function
Created: May 15, 2021
Modified: May 15, 2021

product of experts

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.
  • Introduced by Geoff Hinton (1999): Products of Experts.
  • Each expert produces a probability distribution. These are combined by multiplying the probabilities and renormalizing.
  • We might imagine that each expert provides a set of constraints, and the product must satisfy all of them. For example, one expert might require that the broad outline of a handwritten digit looks like a '1', while another might require that the local details of the pen strokes are plausible.
  • In general it is hard to normalize the product distribution, but using the gradient of the log normalizer trick, and assuming that each expert has separate parameters θm\theta_m, we get Hinton's expression for the gradient of the normalized density:
logp(dθ)θm=logpm(dθm)θmEcp[logpm(cθm)θm]\frac{\partial \log p(d | \theta)}{\partial \theta_m} = \frac{\partial \log p_m(d | \theta_m) }{\partial \theta_m} - E_{c \sim p}\left[ \frac{\partial \log p_m(c | \theta_m)}{\partial \theta_m} \right]

which says that we can compute the gradient wrt the log-normalizer as the expected gradient of the unnormalized term under the data distribution. That is, we can train a product of experts together using contrastive divergence.

  • One can also train the experts separately, each with different initialization or a different architecture, and then use contrastive divergence to fine-tune the ensemble.
  • As Hinton points out, even if the second term is just random noise, the individual experts will still all try to model the data individually. The hard-to-estimate second term simply forces them to work together. (I think this requires the individual experts to be normalized, though, contra to the next point).
  • It is not important that the individual experts be normalized: their normalization constants can be pulled out of the product and pushed into the overall normalizing constant.