gradient of the log normalizer: Nonlinear Function
Created: May 15, 2021
Modified: July 09, 2022

gradient of the log normalizer

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

For a normalized distribution p(x)=f(x;λ)Z(λ)p(x) = \frac{f(x; \lambda)}{Z(\lambda)}, constructed from an (unnormalized) energy f(;λ)f(\cdot; \lambda) with normalizing constant Z(λ)=yf(y;λ)Z(\lambda) = \sum_y f(y; \lambda) as a function of parameters λ\lambda, in general we have

λlogZ(λ)=λlogyf(y;λ)=1Z(λ)yλf(y;λ)=1Z(λ)yf(y;λ)f(y;λ)λf(y;λ)=y(f(y;λ)Z(λ))λlogf(y;λ)=Eyp[λlogf(y;λ)]\begin{align*} \nabla_\lambda \log Z(\lambda) &= \nabla_\lambda\log \sum_y f(y; \lambda)\\ &= \frac{1}{Z(\lambda)} \sum_y \nabla_\lambda f(y; \lambda)\\ &= \frac{1}{Z(\lambda)} \sum_y \frac{f(y; \lambda)}{f(y; \lambda)} \nabla_\lambda f(y; \lambda)\\ & = \sum_y \left(\frac{f(y; \lambda)}{Z(\lambda)}\right) \nabla_\lambda\log f(y; \lambda)\\ &= E_{y \sim p} \left[ \nabla_\lambda\log f(y; \lambda) \right] \end{align*}

That is, the gradient of the log-normalizer is just the expected gradient of the energy / unnormalized log-density.For a normalized distribution this would be the expected score function, but of course such a distribution has log-normalizer of zero, so this is one way to demonstrate that the expectation of the score function is zero. Conversely, this is a special case of using the score function to estimate the gradient of an expectation, where we view the normalizer as the trivial 'expectation' yf(y)dy\int_y f(y) dy.

As a consequence, we can estimate gradients of the log-normalizer without ever directly computing it; we just need the ability to sample from the distribution and evaluate the gradient of the density.

For exponential family distributions in particular, the unnormalized log-density is just λTt(x)\lambda^T t(x), and its gradient wrt λ\lambda is just the sufficient statistic vector t(x)t(x). So this result recovers that the gradient of the log-normalizer at natural parameter λ\lambda recovers the the expected sufficient statistic or mean parameter. This gradient map, which is invertible by the convexity of ZZ in minimal families, functions as the 'mirror map' in mirror descent.