This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.
Let's say we have a parametric model p(x∣θ) of some data. The Cramer-Rao bound
cov(θ^(x))≻Fθ−1
is a lower bound on the (co)variance of any unbiased estimator θ^(x) of the parameter θ, in terms of the Fisher information matrix
Fθ=E[(∇θlogp(x∣θ))(∇θlogp(x∣θ))T]
defined as the covariance of the score function∇θlogp(x∣θ).
The score tells us how a given datapoint wants us to change θ. The expectation of the score is zero, which makes sense: if we're drawing points from the actual distribution p(x∣θ), then on average they shouldn't want to change θ!
The core intuition of the bound is that an unbiased estimator must covary with the score function: when the score function tells us that a particular x wants a larger θ, our unbiased estimator had better give us a larger θ.
One way to see this is as saying that the best unbiased estimators (where the bound is tight) essentially are the score function, but shifted to have mean θ rather than mean zero. A slightly subtle point is that they must also be invariant to parameterization, which the score function is not.For example: if we switch from measuring a parameter in meters to measuring it in kilometers, the sensitivity of the distribution to that parameter (as measured by the score function) will increase by a factor of 1000, but the actual value of the parameter will decrease by the same factor. Thus, we can't just take the score function as our unbiased estimator. Instead the optimal unbiased estimator would in principle be a shifted and scaled version of the score function,
θ∗(x)=θ+Fθ−1∇θlogp(x∣θ)
(this is clearly unbiased because the expectation of the score function is zero). Of course, this is not a practical construction since it assumes access to θ, the quantity we're trying to estimate! But it gets at what we're trying to aim for.
Derivation of the bound
Consider the covariance of an estimator θ^(x) with the score function (note that both quantities are vector-valued random variables in x, so their covariance is meaningful). This is
cov(θ^(x),∇θlogp(x∣θ))=E[(∇θlogp(x∣θ))(θ^(x)−θ)T]=E[(∇θlogp(x∣θ))θ^(x)T]because E[∇θlogp(x∣θ)]θT vanishes=∫(∇θp(x∣θ))θ^(x)Tdx=∇θ∫p(x∣θ)θ^(x)Tdx(note that θ^(x) is constant wrt θ)=∇θE[θ^(x)T]=∇θθT=I for unbiased θ^
(slight abuse of notation: gradients of vector-valued quantities in the last three lines should be interpreted as Jacobian matrices).
By Cauchy-Schwartz, we know ⟨u,v⟩2≤⟨u,u⟩⟨v,v⟩. Since covariance is an inner product, this implies that
cov(θ^(x),∇logp(x∣θ))2=I≺Fθcov(θ^(x))
which implies the bound
cov(θ^(x))≻Fθ−1
As a sanity check, we can verify that the estimator θ∗ proposed above achieves this bound: