Created: May 15, 2021
Modified: July 09, 2022

gradient of the log normalizer

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

For a normalized distribution $p(x) = \frac{f(x; \lambda)}{Z(\lambda)}$ , constructed from an (unnormalized) energy $f(\cdot; \lambda)$ with normalizing constant $Z(\lambda) = \sum_y f(y; \lambda)$ as a function of parameters $\lambda$ , in general we have

\begin{align*} \nabla_\lambda \log Z(\lambda) &= \nabla_\lambda\log \sum_y f(y; \lambda)\\ &= \frac{1}{Z(\lambda)} \sum_y \nabla_\lambda f(y; \lambda)\\ &= \frac{1}{Z(\lambda)} \sum_y \frac{f(y; \lambda)}{f(y; \lambda)} \nabla_\lambda f(y; \lambda)\\ & = \sum_y \left(\frac{f(y; \lambda)}{Z(\lambda)}\right) \nabla_\lambda\log f(y; \lambda)\\ &= E_{y \sim p} \left[ \nabla_\lambda\log f(y; \lambda) \right] \end{align*}

That is, the gradient of the log-normalizer is just the expected gradient of the energy / unnormalized log-density.For a normalized distribution this would be the expected score function, but of course such a distribution has log-normalizer of zero, so this is one way to demonstrate that the expectation of the score function is zero. Conversely, this is a special case of using the score function to estimate the gradient of an expectation, where we view the normalizer as the trivial 'expectation' $\int_y f(y) dy$ .

As a consequence, we can estimate gradients of the log-normalizer without ever directly computing it; we just need the ability to sample from the distribution and evaluate the gradient of the density.

For exponential family distributions in particular, the unnormalized log-density is just $\lambda^T t(x)$ , and its gradient wrt $\lambda$ is just the sufficient statistic vector $t(x)$ . So this result recovers that the gradient of the log-normalizer at natural parameter $\lambda$ recovers the the expected sufficient statistic or mean parameter. This gradient map, which is invertible by the convexity of $Z$ in minimal families, functions as the 'mirror map' in mirror descent.

gradient of the log normalizer

Links to this note

score function

product of experts

contrastive divergence

structured prediction

Meta