Created: January 16, 2022
Modified: July 21, 2022

score function

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

The score function is the gradient of a log-density with respect to its parameters:

s(x; \theta) = \nabla_\theta \log p(x; \theta).

It is the direction that we would move the parameters in order to make the current point $x$ more likely. Equivalently, it measures the sensitivity of the likelihood at $x$ to each dimension of the parameter $\theta$ . It is used to define Fisher information, to estimate policy gradients, and generally comes up all over the place in statistical analysis.

Note that $\nabla_\theta \log p(x; \theta) = \frac{\nabla_\theta p(x; \theta)}{p(x; \theta)}$ by basic properties of the derivative. Inside an expectation, the $p(x; \theta)$ in the denominator cancels with the $p(x; \theta)$ term of the expectation; the is the fundamental 'trick' of many score function derivations. For example,

Theorem: the score function has expectation 0:

\begin{align*} E_{p}[s(x; \theta)] &= \int_\Omega p(x; \theta) \nabla_\theta \log p(x; \theta) dx\\ &= \int_\Omega \frac{p(x; \theta)}{p(x; \theta)} \nabla_\theta p(x; \theta) dx \end{align*}

(applying the above 'trick'):

= \nabla_\theta \int_\Omega p(x; \theta) dx = \nabla_\theta 1 = 0,

assuming that the dominated convergence theorem holds to allow us to swap the integral with the limit.

Gradients of expectations

The score function comes up as the general-purpose gradient estimator for expectations:

\begin{align*} \nabla_\theta E_{x\sim p_\theta}[f_\theta(x)] &= \nabla_\theta \int_{\Omega} p_\theta(x) f_\theta(x) dx\\ &= \int_{\Omega} \nabla_\theta p_\theta(x) f_\theta(x) dx \text{ (assuming dominated convergence)}\\ &\qquad \\ &= \int_{\Omega} p_\theta(x) \nabla_\theta f_\theta(x) dx + \int_{\Omega} p_\theta(x) \left(\nabla_\theta \log p_\theta(x)\right) f_\theta(x) dx\\ \end{align*}

and applying the product rule and the score 'trick' from above

= E_{x\sim p_\theta} \left[\nabla_\theta f_\theta(x) + f_\theta(x) \nabla_\theta \log p_\theta(x) \right]

See Mohamed et al. (2019) for much much more on this.

Note that we can replace $f_\theta(x)$ in the above by any constant-shifted expression $f_\theta(x) - \beta$ ; this follows from the above result that $\beta \nabla_\theta \log p_\theta(x)$ has expectation zero. Here $\beta$ is known as a 'baseline' or control variate.

Surrogate objectives

If we naively sampled $x\sim p_\theta$ and then apply automatic differentiation to the expression inside the expectation $f_\theta(x)$ , we would get an incorrect result that ignores the dependence through the sampling distribution $p_\theta$ (see limitations of autodiff). The reparameterization trick is one approach to fixing this. A more general approach is to write the objective as

\mathbb{E}_{x\sim p_\tilde{\theta}} \left[\frac{p_\theta(x)}{p_\tilde{\theta}(x)} f_\theta(x)\right]

where $\tilde{\theta} = \mathtt{stop\_gradient}(\theta)$ is a frozen copy of the parameters. This brings the dependence on $\theta$ explicitly into the objective, so that autodiff (even repeated autodiff for higher-order derivatives) does the right thing, while maintaining the objective's value.This trick is described in the DiCE paper, though it predates that paper and has been independently invented multiple times. If desired we can introduce a baseline $\beta$ as

\mathbb{E}_{x\sim p_\tilde{\theta}} \left[\frac{p_\theta(x)}{p_\tilde{\theta}(x)} (f_\theta(x) - \beta) + \beta\right]

so that the objective itself is again unchanged, but autodiff recovers the gradient estimator including $\beta$ as a control variate (note that the second $\beta$ term may be omitted if we only care about the surrogate objective's gradients rather than its value directly).

Optimal baseline

Let $g_\beta(x)$ denote the the gradient estimator (the quantity inside the expectation) with constant baseline $\beta$ ,

g_\beta(x) = \nabla_\theta f_\theta(x) + \left(f_\theta(x) - \beta\right)\nabla_\theta \log p_\theta(x).

What choice for $\beta$ minimizes the variance of this estimate? For simplicity, let's assume $\theta$ is scalar, so that gradients become simple derivatives. Then we have

\begin{align*} \text{var}(g_\beta) &= \mathbb{E}_{x\sim p_\theta}\left[g_\beta(x)^2\right] - \mathbb{E}_{x\sim p_\theta}\left[g_\beta(x)\right]^2\\ &= \mathbb{E}_{x\sim p_\theta}\left[g_\beta(x)^2\right] + \ldots \end{align*}

where we note that the second term is just the square of the derivative being estimated, which we can ignore because it does not depend on $\beta$ . Expanding out the remaining expression, and sweeping aside similar constant terms,

\begin{alignat*}{2} \text{var}(g_\beta) &= \mathbb{E}_{x\sim p_\theta}&&\left[\left(\frac{\partial f_\theta(x)}{\partial \theta} + \frac{\partial \log p_\theta(x)}{\partial \theta} \left(f_\theta(x) - \beta\right)\right)^2\right] + \ldots\\ &= \mathbb{E}_{x\sim p_\theta}&&\left[ -2\beta \left(f_\theta(x)\left(\frac{\partial \log p_\theta(x)}{\partial \theta}\right)^2 + \frac{\partial f_\theta(x)}{\partial \theta}\frac{\partial \log p_\theta(x)}{\partial \theta}\right)\right.\\ & &&\left.+\beta^2 \left(\frac{\partial \log p_\theta(x)}{\partial \theta}\right)^2 \right] + \ldots\\ \end{alignat*}

Setting the derivative with respect to $\beta$ to zero gives the optimality condition

\beta^* = \frac{\mathbb{E}\left[ f_\theta(x)\left(\frac{\partial \log p_\theta(x)}{\partial \theta}\right)^2\right] + \mathbb{E}\left[\frac{\partial f_\theta(x)}{\partial \theta}\frac{\partial \log p_\theta(x)}{\partial \theta}\right]}{\mathbb{E}\left[\left(\frac{\partial \log p_\theta(x)}{\partial \theta}\right)^2 \right]}

This is a bit unwieldy, but it simplifies dramatically if we allow ourselves to assume that $f_\theta(x)$ and $p_\theta(x)$ are independent. Then we can factor the expectations, so that the term

\mathbb{E}\left[\frac{\partial f_\theta(x)}{\partial \theta}\frac{\partial \log p_\theta(x)}{\partial \theta}\right] = \mathbb{E}\left[\frac{\partial f_\theta(x)}{\partial \theta}\right]\mathbb{E}\left[\frac{\partial \log p_\theta(x)}{\partial \theta}\right] = 0

vanishes, using the property that the score function has expectation zero, and we get cancellation in the remaining terms

\beta^* = \frac{\mathbb{E}\left[ f_\theta(x)\right]\mathbb{E}\left[\left(\frac{\partial \log p_\theta(x)}{\partial \theta}\right)^2\right]}{\mathbb{E}\left[\left(\frac{\partial \log p_\theta(x)}{\partial \theta}\right)^2 \right]} = \mathbb{E}f_\theta(x).

giving the intuitive result that the minimum-variance baseline $\beta^*$ is just the value of the expectation whose gradient we're trying to estimate. Note that this derivation relies on the assumption that $p_\theta(x)$ and $f_\theta(x)$ are independent, which may not hold in practice.For example, in reinforcement learning the probability of being in state $x$ is hopefully not independent of the reward of state $x$ , since the task is explicitly to find policies that spend more time in high-reward states!

Unnormalized densities

What if we have $p_\theta(x)$ available only in unnormalized form, as

r_\theta(x) = Z_\theta p_\theta(x)

where $Z_\theta = \int_\Omega r_\theta(x) dx$ ? Then it seems like we require the gradient of the log normalizer $\log Z_\theta$ . That's not a big deal, because this can itself be estimated from samples:

\begin{align*} \nabla_\theta \log \int_\Omega r_\theta(x) dx &= \frac{1}{Z_\theta} E\left[ \frac{\nabla_\theta r_\theta(x)}{p_\theta(x)} \right]\\ &= E\left[ \frac{\nabla_\theta r_\theta(x)}{r_\theta(x)} \right]\\ &= E\left[\nabla_\theta \log r_\theta(x) \right] \end{align*}

leaving us with

\nabla_\theta \log p_\theta(x) = \nabla_\theta \log r_\theta(x) - E\left[\nabla_\theta \log r_\theta(x) \right].

One way to understand this is that the score function itself must have mean zero because the distribution is normalized. So if we have an unnormalized density, we just need to subtract the mean of its proto-score-function in order to get the mean-zero score function.

I suppose that for an unbiased estimate, we'd need to estimate the inner expectation with a different independent sample $x\sim p$ than we used for the outer expectation. This is essentially contrastive divergence, except that we probably don't need MCMC because we can sample explicitly from $p$ .

score function

Gradients of expectations

Surrogate objectives

Optimal baseline

Unnormalized densities

Links to this note

score matching

deep deterministic policy gradient

proximal policy optimization

maximum-entropy reinforcement learning

policy gradient

deep RL notes

hard attention

gradient of the log normalizer

Cramer-Rao bound

Meta