Created: October 02, 2020
Modified: July 05, 2022

Cramer-Rao bound

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Related to natural gradient and the Fisher information matrix.

Let's say we have a parametric model $p(x | \theta)$ of some data. The Cramer-Rao bound

\text{cov}(\hat\theta(x)) \succ F^{-1}_\theta

is a lower bound on the (co)variance of any unbiased estimator $\hat{\theta}(x)$ of the parameter $\theta$ , in terms of the Fisher information matrix

F_\theta = \mathbb{E}\left[\left(\nabla_\theta\log p(x | \theta)\right)(\nabla_\theta \log p(x | \theta))^T\right]

defined as the covariance of the score function $\nabla_\theta \log p(x | \theta).$

The score tells us how a given datapoint wants us to change $\theta$ . The expectation of the score is zero, which makes sense: if we're drawing points from the actual distribution $p(x | \theta)$ , then on average they shouldn't want to change $\theta$ !

The core intuition of the bound is that an unbiased estimator must covary with the score function: when the score function tells us that a particular $x$ wants a larger $\theta$ , our unbiased estimator had better give us a larger $\theta$ .

One way to see this is as saying that the best unbiased estimators (where the bound is tight) essentially are the score function, but shifted to have mean $\theta$ rather than mean zero. A slightly subtle point is that they must also be invariant to parameterization, which the score function is not.For example: if we switch from measuring a parameter in meters to measuring it in kilometers, the sensitivity of the distribution to that parameter (as measured by the score function) will increase by a factor of 1000, but the actual value of the parameter will decrease by the same factor. Thus, we can't just take the score function as our unbiased estimator. Instead the optimal unbiased estimator would in principle be a shifted and scaled version of the score function,

\theta^*(x) = \theta + F_\theta^{-1} \nabla_\theta \log p(x | \theta)

(this is clearly unbiased because the expectation of the score function is zero). Of course, this is not a practical construction since it assumes access to $\theta$ , the quantity we're trying to estimate! But it gets at what we're trying to aim for.

Derivation of the bound

Consider the covariance of an estimator $\hat{\theta}(x)$ with the score function (note that both quantities are vector-valued random variables in $x$ , so their covariance is meaningful). This is

\begin{align*} \text{cov}\left(\hat\theta(x), \nabla_\theta \log p(x | \theta)\right) &= \mathbb{E}\left[\left(\nabla_\theta \log p(x | \theta)\right)\left(\hat\theta(x) - \theta\right)^T \right]\\ &= \mathbb{E}\left[\left(\nabla_\theta \log p(x | \theta)\right)\hat\theta(x)^T\right]\\ &\qquad\text{because $\mathbb{E}\left[\nabla_\theta \log p(x | \theta)\right]\theta^T$ vanishes}\\ &= \int \left(\nabla_\theta p(x | \theta)\right)\hat\theta(x)^T dx\\ &= \nabla_\theta \int p(x | \theta) \hat\theta(x)^T dx\\ &\qquad\text{(note that $\hat\theta(x)$ is constant wrt $\theta$)}\\ &=\nabla_\theta E[\hat\theta(x)^T]\\ &=\nabla_\theta \theta^T = \mathcal{I}\text{ for unbiased }\hat{\theta} \end{align*}

(slight abuse of notation: gradients of vector-valued quantities in the last three lines should be interpreted as Jacobian matrices).

By Cauchy-Schwartz, we know $\langle u, v\rangle^2 \le \langle u, u\rangle\langle v, v\rangle$ . Since covariance is an inner product, this implies that

\text{cov}\left(\hat\theta(x), \nabla \log p(x | \theta)\right)^2 = \mathcal{I} \prec F_\theta \text{cov}(\hat \theta(x))

which implies the bound

\text{cov}(\hat\theta(x)) \succ F^{-1}_\theta

As a sanity check, we can verify that the estimator $\theta^*$ proposed above achieves this bound:

\begin{align*} \text{cov}(\theta^*(x), \theta^*(x)) &= \mathbb{E}\left[\left(F_\theta^{-1} \nabla_\theta \log p(x | \theta)\right)\left(F_\theta^{-1} \nabla_\theta \log p(x | \theta)\right)^T\right]\\ &=F_\theta^{-1} \mathbb{E}\left[\left(\nabla_\theta \log p(x | \theta)\right)\left(\nabla_\theta \log p(x | \theta)\right)^T\right] \left(F_\theta^{-1}\right)^T\\ &=F_\theta^{-1} F_\theta F_\theta^{-1}\\ &=F_\theta^{-1} \end{align*}

Cramer-Rao bound

Derivation of the bound

Meta