score function: Nonlinear Function
Created: January 16, 2022
Modified: July 21, 2022

score function

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

The score function is the gradient of a log-density with respect to its parameters:

s(x;θ)=θlogp(x;θ).s(x; \theta) = \nabla_\theta \log p(x; \theta).

It is the direction that we would move the parameters in order to make the current point xx more likely. Equivalently, it measures the sensitivity of the likelihood at xx to each dimension of the parameter θ\theta. It is used to define Fisher information, to estimate policy gradients, and generally comes up all over the place in statistical analysis.

Note that θlogp(x;θ)=θp(x;θ)p(x;θ)\nabla_\theta \log p(x; \theta) = \frac{\nabla_\theta p(x; \theta)}{p(x; \theta)} by basic properties of the derivative. Inside an expectation, the p(x;θ)p(x; \theta) in the denominator cancels with the p(x;θ)p(x; \theta) term of the expectation; the is the fundamental 'trick' of many score function derivations. For example,

Theorem: the score function has expectation 0:

Ep[s(x;θ)]=Ωp(x;θ)θlogp(x;θ)dx=Ωp(x;θ)p(x;θ)θp(x;θ)dx\begin{align*} E_{p}[s(x; \theta)] &= \int_\Omega p(x; \theta) \nabla_\theta \log p(x; \theta) dx\\ &= \int_\Omega \frac{p(x; \theta)}{p(x; \theta)} \nabla_\theta p(x; \theta) dx \end{align*}

(applying the above 'trick'):

=θΩp(x;θ)dx=θ1=0,= \nabla_\theta \int_\Omega p(x; \theta) dx = \nabla_\theta 1 = 0,

assuming that the dominated convergence theorem holds to allow us to swap the integral with the limit.

Gradients of expectations

The score function comes up as the general-purpose gradient estimator for expectations:

θExpθ[fθ(x)]=θΩpθ(x)fθ(x)dx=Ωθpθ(x)fθ(x)dx (assuming dominated convergence)=Ωpθ(x)θfθ(x)dx+Ωpθ(x)(θlogpθ(x))fθ(x)dx\begin{align*} \nabla_\theta E_{x\sim p_\theta}[f_\theta(x)] &= \nabla_\theta \int_{\Omega} p_\theta(x) f_\theta(x) dx\\ &= \int_{\Omega} \nabla_\theta p_\theta(x) f_\theta(x) dx \text{ (assuming dominated convergence)}\\ &\qquad \\ &= \int_{\Omega} p_\theta(x) \nabla_\theta f_\theta(x) dx + \int_{\Omega} p_\theta(x) \left(\nabla_\theta \log p_\theta(x)\right) f_\theta(x) dx\\ \end{align*}

and applying the product rule and the score 'trick' from above

=Expθ[θfθ(x)+fθ(x)θlogpθ(x)]= E_{x\sim p_\theta} \left[\nabla_\theta f_\theta(x) + f_\theta(x) \nabla_\theta \log p_\theta(x) \right]

See Mohamed et al. (2019) for much much more on this.

Note that we can replace fθ(x)f_\theta(x) in the above by any constant-shifted expression fθ(x)βf_\theta(x) - \beta; this follows from the above result that βθlogpθ(x)\beta \nabla_\theta \log p_\theta(x) has expectation zero. Here β\beta is known as a 'baseline' or control variate.

Surrogate objectives

If we naively sampled xpθx\sim p_\theta and then apply automatic differentiation to the expression inside the expectation fθ(x)f_\theta(x), we would get an incorrect result that ignores the dependence through the sampling distribution pθp_\theta (see limitations of autodiff). The reparameterization trick is one approach to fixing this. A more general approach is to write the objective as

\mathbb{E}_{x\sim p_\tilde{\theta}} \left[\frac{p_\theta(x)}{p_\tilde{\theta}(x)} f_\theta(x)\right]

where θ~=stop_gradient(θ)\tilde{\theta} = \mathtt{stop\_gradient}(\theta) is a frozen copy of the parameters. This brings the dependence on θ\theta explicitly into the objective, so that autodiff (even repeated autodiff for higher-order derivatives) does the right thing, while maintaining the objective's value.This trick is described in the DiCE paper, though it predates that paper and has been independently invented multiple times. If desired we can introduce a baseline β\beta as

\mathbb{E}_{x\sim p_\tilde{\theta}} \left[\frac{p_\theta(x)}{p_\tilde{\theta}(x)} (f_\theta(x) - \beta) + \beta\right]

so that the objective itself is again unchanged, but autodiff recovers the gradient estimator including β\beta as a control variate (note that the second β\beta term may be omitted if we only care about the surrogate objective's gradients rather than its value directly).

Optimal baseline

Let gβ(x)g_\beta(x) denote the the gradient estimator (the quantity inside the expectation) with constant baseline β\beta,

gβ(x)=θfθ(x)+(fθ(x)β)θlogpθ(x).g_\beta(x) = \nabla_\theta f_\theta(x) + \left(f_\theta(x) - \beta\right)\nabla_\theta \log p_\theta(x).

What choice for β\beta minimizes the variance of this estimate? For simplicity, let's assume θ\theta is scalar, so that gradients become simple derivatives. Then we have

var(gβ)=Expθ[gβ(x)2]Expθ[gβ(x)]2=Expθ[gβ(x)2]+\begin{align*} \text{var}(g_\beta) &= \mathbb{E}_{x\sim p_\theta}\left[g_\beta(x)^2\right] - \mathbb{E}_{x\sim p_\theta}\left[g_\beta(x)\right]^2\\ &= \mathbb{E}_{x\sim p_\theta}\left[g_\beta(x)^2\right] + \ldots \end{align*}

where we note that the second term is just the square of the derivative being estimated, which we can ignore because it does not depend on β\beta. Expanding out the remaining expression, and sweeping aside similar constant terms,

var(gβ)=Expθ[(fθ(x)θ+logpθ(x)θ(fθ(x)β))2]+=Expθ[2β(fθ(x)(logpθ(x)θ)2+fθ(x)θlogpθ(x)θ)+β2(logpθ(x)θ)2]+\begin{alignat*}{2} \text{var}(g_\beta) &= \mathbb{E}_{x\sim p_\theta}&&\left[\left(\frac{\partial f_\theta(x)}{\partial \theta} + \frac{\partial \log p_\theta(x)}{\partial \theta} \left(f_\theta(x) - \beta\right)\right)^2\right] + \ldots\\ &= \mathbb{E}_{x\sim p_\theta}&&\left[ -2\beta \left(f_\theta(x)\left(\frac{\partial \log p_\theta(x)}{\partial \theta}\right)^2 + \frac{\partial f_\theta(x)}{\partial \theta}\frac{\partial \log p_\theta(x)}{\partial \theta}\right)\right.\\ & &&\left.+\beta^2 \left(\frac{\partial \log p_\theta(x)}{\partial \theta}\right)^2 \right] + \ldots\\ \end{alignat*}

Setting the derivative with respect to β\beta to zero gives the optimality condition

β=E[fθ(x)(logpθ(x)θ)2]+E[fθ(x)θlogpθ(x)θ]E[(logpθ(x)θ)2]\beta^* = \frac{\mathbb{E}\left[ f_\theta(x)\left(\frac{\partial \log p_\theta(x)}{\partial \theta}\right)^2\right] + \mathbb{E}\left[\frac{\partial f_\theta(x)}{\partial \theta}\frac{\partial \log p_\theta(x)}{\partial \theta}\right]}{\mathbb{E}\left[\left(\frac{\partial \log p_\theta(x)}{\partial \theta}\right)^2 \right]}

This is a bit unwieldy, but it simplifies dramatically if we allow ourselves to assume that fθ(x)f_\theta(x) and pθ(x)p_\theta(x) are independent. Then we can factor the expectations, so that the term

E[fθ(x)θlogpθ(x)θ]=E[fθ(x)θ]E[logpθ(x)θ]=0\mathbb{E}\left[\frac{\partial f_\theta(x)}{\partial \theta}\frac{\partial \log p_\theta(x)}{\partial \theta}\right] = \mathbb{E}\left[\frac{\partial f_\theta(x)}{\partial \theta}\right]\mathbb{E}\left[\frac{\partial \log p_\theta(x)}{\partial \theta}\right] = 0

vanishes, using the property that the score function has expectation zero, and we get cancellation in the remaining terms

β=E[fθ(x)]E[(logpθ(x)θ)2]E[(logpθ(x)θ)2]=Efθ(x).\beta^* = \frac{\mathbb{E}\left[ f_\theta(x)\right]\mathbb{E}\left[\left(\frac{\partial \log p_\theta(x)}{\partial \theta}\right)^2\right]}{\mathbb{E}\left[\left(\frac{\partial \log p_\theta(x)}{\partial \theta}\right)^2 \right]} = \mathbb{E}f_\theta(x).

giving the intuitive result that the minimum-variance baseline β\beta^* is just the value of the expectation whose gradient we're trying to estimate. Note that this derivation relies on the assumption that pθ(x)p_\theta(x) and fθ(x)f_\theta(x) are independent, which may not hold in practice.For example, in reinforcement learning the probability of being in state xx is hopefully not independent of the reward of state xx, since the task is explicitly to find policies that spend more time in high-reward states!

Unnormalized densities

What if we have pθ(x)p_\theta(x) available only in unnormalized form, as

rθ(x)=Zθpθ(x)r_\theta(x) = Z_\theta p_\theta(x)

where Zθ=Ωrθ(x)dxZ_\theta = \int_\Omega r_\theta(x) dx? Then it seems like we require the gradient of the log normalizer logZθ\log Z_\theta. That's not a big deal, because this can itself be estimated from samples:

θlogΩrθ(x)dx=1ZθE[θrθ(x)pθ(x)]=E[θrθ(x)rθ(x)]=E[θlogrθ(x)]\begin{align*} \nabla_\theta \log \int_\Omega r_\theta(x) dx &= \frac{1}{Z_\theta} E\left[ \frac{\nabla_\theta r_\theta(x)}{p_\theta(x)} \right]\\ &= E\left[ \frac{\nabla_\theta r_\theta(x)}{r_\theta(x)} \right]\\ &= E\left[\nabla_\theta \log r_\theta(x) \right] \end{align*}

leaving us with

θlogpθ(x)=θlogrθ(x)E[θlogrθ(x)].\nabla_\theta \log p_\theta(x) = \nabla_\theta \log r_\theta(x) - E\left[\nabla_\theta \log r_\theta(x) \right].

One way to understand this is that the score function itself must have mean zero because the distribution is normalized. So if we have an unnormalized density, we just need to subtract the mean of its proto-score-function in order to get the mean-zero score function.

I suppose that for an unbiased estimate, we'd need to estimate the inner expectation with a different independent sample xpx\sim p than we used for the outer expectation. This is essentially contrastive divergence, except that we probably don't need MCMC because we can sample explicitly from pp.