This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.
The score function is the gradient of a log-density with respect to its parameters:
s(x;θ)=∇θlogp(x;θ).
It is the direction that we would move the parameters in order to make the current point x more likely. Equivalently, it measures the sensitivity of the likelihood at x to each dimension of the parameter θ. It is used to define Fisher information, to estimate policy gradients, and generally comes up all over the place in statistical analysis.
Note that ∇θlogp(x;θ)=p(x;θ)∇θp(x;θ) by basic properties of the derivative. Inside an expectation, the p(x;θ) in the denominator cancels with the p(x;θ) term of the expectation; the is the fundamental 'trick' of many score function derivations. For example,
Note that we can replace fθ(x) in the above by any constant-shifted expression fθ(x)−β; this follows from the above result that β∇θlogpθ(x) has expectation zero. Here β is known as a 'baseline' or control variate.
Surrogate objectives
If we naively sampled x∼pθ and then apply automatic differentiation to the expression inside the expectation fθ(x), we would get an incorrect result that ignores the dependence through the sampling distribution pθ (see limitations of autodiff). The reparameterization trick is one approach to fixing this. A more general approach is to write the objective as
where θ~=stop_gradient(θ) is a frozen copy of the parameters. This brings the dependence on θ explicitly into the objective, so that autodiff (even repeated autodiff for higher-order derivatives) does the right thing, while maintaining the objective's value.This trick is described in the DiCE paper, though it predates that paper and has been independently invented multiple times. If desired we can introduce a baseline β as
so that the objective itself is again unchanged, but autodiff recovers the gradient estimator including β as a control variate (note that the second β term may be omitted if we only care about the surrogate objective's gradients rather than its value directly).
Optimal baseline
Let gβ(x) denote the the gradient estimator (the quantity inside the expectation) with constant baseline β,
gβ(x)=∇θfθ(x)+(fθ(x)−β)∇θlogpθ(x).
What choice for β minimizes the variance of this estimate? For simplicity, let's assume θ is scalar, so that gradients become simple derivatives. Then we have
where we note that the second term is just the square of the derivative being estimated, which we can ignore because it does not depend on β. Expanding out the remaining expression, and sweeping aside similar constant terms,
This is a bit unwieldy, but it simplifies dramatically if we allow ourselves to assume that fθ(x) and pθ(x) are independent. Then we can factor the expectations, so that the term
giving the intuitive result that the minimum-variance baseline β∗ is just the value of the expectation whose gradient we're trying to estimate. Note that this derivation relies on the assumption that pθ(x) and fθ(x) are independent, which may not hold in practice.For example, in reinforcement learning the probability of being in state x is hopefully not independent of the reward of state x, since the task is explicitly to find policies that spend more time in high-reward states!
Unnormalized densities
What if we have pθ(x) available only in unnormalized form, as
rθ(x)=Zθpθ(x)
where Zθ=∫Ωrθ(x)dx? Then it seems like we require the gradient of the log normalizerlogZθ. That's not a big deal, because this can itself be estimated from samples:
One way to understand this is that the score function itself must have mean zero because the distribution is normalized. So if we have an unnormalized density, we just need to subtract the mean of its proto-score-function in order to get the mean-zero score function.
I suppose that for an unbiased estimate, we'd need to estimate the inner expectation with a different independent sample x∼p than we used for the outer expectation. This is essentially contrastive divergence, except that we probably don't need MCMC because we can sample explicitly from p.