Created: June 08, 2020
Modified: May 15, 2021

contrastive divergence

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

A method for fitting an unnormalized probability density (aka energy-based model) to data. Note that this is a different and harder problem than standard MLE/MAP estimation, where the likelihood is unnormalized with respect to the parameter; here it's unnormalized wrt the data.

Using the gradient of the log normalizer trick, it is sufficient to be able to sample from a distribution in order to estimate gradients of its normalized density.

So we can use MCMC or other methods to draw samples from the unnormalized density. Naively, this would suck because you're running an MCMC chain to convergence at each step just to get a stochastic gradient.

But maybe you don't need your MCMC to converge, you just need something. So: initialize at the current data point, and then run one or two MCMC steps. Hinton argues that this is actually better, because taking too many MCMC steps introduces variance that swamps the gradient estimate.

Q: Why do we need any MCMC steps at all? Why can't we just use the empirical data distribution as a proxy for the model's distribution?

Because we're already doing that in the first term. Writing the full-data likelihood

\log p(Y, X, w) = \frac{1}{N} \sum_{j=1}^N \sum_i w_i^T f_i(x^{(j)}, y^{(j)})

we see that this is an expectation over the data distribution, and its gradient is just $E_{x, y \sim D}\left[\sum_i f_i(x, y)\right]$ . Doing the same computation for the $Z$ term would give the same result, and the gradient will always be zero. Running one or two MCMC steps gives you something that should tend to point noisily in the direction of the true gradient, although you'd need infinitely many steps to remove the bias.

What happens if you run CD on an energy that doesn't normalize? It'll still do something…

contrastive divergence

Links to this note

score function

product of experts

structured prediction

Meta