Cramer-Rao bound: Nonlinear Function
Created: October 02, 2020
Modified: July 05, 2022

Cramer-Rao bound

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Let's say we have a parametric model p(xθ)p(x | \theta) of some data. The Cramer-Rao bound

cov(θ^(x))Fθ1\text{cov}(\hat\theta(x)) \succ F^{-1}_\theta

is a lower bound on the (co)variance of any unbiased estimator θ^(x)\hat{\theta}(x) of the parameter θ\theta, in terms of the Fisher information matrix

Fθ=E[(θlogp(xθ))(θlogp(xθ))T]F_\theta = \mathbb{E}\left[\left(\nabla_\theta\log p(x | \theta)\right)(\nabla_\theta \log p(x | \theta))^T\right]

defined as the covariance of the score function θlogp(xθ).\nabla_\theta \log p(x | \theta).

The score tells us how a given datapoint wants us to change θ\theta. The expectation of the score is zero, which makes sense: if we're drawing points from the actual distribution p(xθ)p(x | \theta), then on average they shouldn't want to change θ\theta!

The core intuition of the bound is that an unbiased estimator must covary with the score function: when the score function tells us that a particular xx wants a larger θ\theta, our unbiased estimator had better give us a larger θ\theta.

One way to see this is as saying that the best unbiased estimators (where the bound is tight) essentially are the score function, but shifted to have mean θ\theta rather than mean zero. A slightly subtle point is that they must also be invariant to parameterization, which the score function is not.For example: if we switch from measuring a parameter in meters to measuring it in kilometers, the sensitivity of the distribution to that parameter (as measured by the score function) will increase by a factor of 1000, but the actual value of the parameter will decrease by the same factor. Thus, we can't just take the score function as our unbiased estimator. Instead the optimal unbiased estimator would in principle be a shifted and scaled version of the score function,

θ(x)=θ+Fθ1θlogp(xθ)\theta^*(x) = \theta + F_\theta^{-1} \nabla_\theta \log p(x | \theta)

(this is clearly unbiased because the expectation of the score function is zero). Of course, this is not a practical construction since it assumes access to θ\theta, the quantity we're trying to estimate! But it gets at what we're trying to aim for.

Derivation of the bound

Consider the covariance of an estimator θ^(x)\hat{\theta}(x) with the score function (note that both quantities are vector-valued random variables in xx, so their covariance is meaningful). This is

cov(θ^(x),θlogp(xθ))=E[(θlogp(xθ))(θ^(x)θ)T]=E[(θlogp(xθ))θ^(x)T]because E[θlogp(xθ)]θT vanishes=(θp(xθ))θ^(x)Tdx=θp(xθ)θ^(x)Tdx(note that θ^(x) is constant wrt θ)=θE[θ^(x)T]=θθT=I for unbiased θ^\begin{align*} \text{cov}\left(\hat\theta(x), \nabla_\theta \log p(x | \theta)\right) &= \mathbb{E}\left[\left(\nabla_\theta \log p(x | \theta)\right)\left(\hat\theta(x) - \theta\right)^T \right]\\ &= \mathbb{E}\left[\left(\nabla_\theta \log p(x | \theta)\right)\hat\theta(x)^T\right]\\ &\qquad\text{because $\mathbb{E}\left[\nabla_\theta \log p(x | \theta)\right]\theta^T$ vanishes}\\ &= \int \left(\nabla_\theta p(x | \theta)\right)\hat\theta(x)^T dx\\ &= \nabla_\theta \int p(x | \theta) \hat\theta(x)^T dx\\ &\qquad\text{(note that $\hat\theta(x)$ is constant wrt $\theta$)}\\ &=\nabla_\theta E[\hat\theta(x)^T]\\ &=\nabla_\theta \theta^T = \mathcal{I}\text{ for unbiased }\hat{\theta} \end{align*}

(slight abuse of notation: gradients of vector-valued quantities in the last three lines should be interpreted as Jacobian matrices).

By Cauchy-Schwartz, we know u,v2u,uv,v\langle u, v\rangle^2 \le \langle u, u\rangle\langle v, v\rangle. Since covariance is an inner product, this implies that

cov(θ^(x),logp(xθ))2=IFθcov(θ^(x))\text{cov}\left(\hat\theta(x), \nabla \log p(x | \theta)\right)^2 = \mathcal{I} \prec F_\theta \text{cov}(\hat \theta(x))

which implies the bound

cov(θ^(x))Fθ1\text{cov}(\hat\theta(x)) \succ F^{-1}_\theta

As a sanity check, we can verify that the estimator θ\theta^* proposed above achieves this bound:

cov(θ(x),θ(x))=E[(Fθ1θlogp(xθ))(Fθ1θlogp(xθ))T]=Fθ1E[(θlogp(xθ))(θlogp(xθ))T](Fθ1)T=Fθ1FθFθ1=Fθ1\begin{align*} \text{cov}(\theta^*(x), \theta^*(x)) &= \mathbb{E}\left[\left(F_\theta^{-1} \nabla_\theta \log p(x | \theta)\right)\left(F_\theta^{-1} \nabla_\theta \log p(x | \theta)\right)^T\right]\\ &=F_\theta^{-1} \mathbb{E}\left[\left(\nabla_\theta \log p(x | \theta)\right)\left(\nabla_\theta \log p(x | \theta)\right)^T\right] \left(F_\theta^{-1}\right)^T\\ &=F_\theta^{-1} F_\theta F_\theta^{-1}\\ &=F_\theta^{-1} \end{align*}