Modified: July 09, 2022
gradient of the log normalizer
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.For a normalized distribution , constructed from an (unnormalized) energy with normalizing constant as a function of parameters , in general we have
That is, the gradient of the log-normalizer is just the expected gradient of the energy / unnormalized log-density.For a normalized distribution this would be the expected score function, but of course such a distribution has log-normalizer of zero, so this is one way to demonstrate that the expectation of the score function is zero. Conversely, this is a special case of using the score function to estimate the gradient of an expectation, where we view the normalizer as the trivial 'expectation' .
As a consequence, we can estimate gradients of the log-normalizer without ever directly computing it; we just need the ability to sample from the distribution and evaluate the gradient of the density.
For exponential family distributions in particular, the unnormalized log-density is just , and its gradient wrt is just the sufficient statistic vector . So this result recovers that the gradient of the log-normalizer at natural parameter recovers the the expected sufficient statistic or mean parameter. This gradient map, which is invertible by the convexity of in minimal families, functions as the 'mirror map' in mirror descent.