Occam's razor: Nonlinear Function
Created: March 02, 2022
Modified: March 03, 2022

Occam's razor

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

If two hypotheses are equally consistent with the data, the simpler is more likely to be 'true'. Formally, it is more likely to generalize, as in the Occam generalization bound.

Occam's Razor is the philosophical intuition behind the learning principle of minimum description length.

Bayesian Occam's Razor

Occam's razor is often invoked to motivate explicit complexity penalties. Notably, Bayesian model selection and model averaging will tend to prefer 'simpler' models even without any explicit complexity penalty. The requirement that probabilities normalize to 11 effectively serves as an automatic complexity penalty, since a model that can explain many different outcomes will be unable to give much mass to any individual outcome.

Concretely, say we've observed data XX, and are choosing between models AA and BB. Model A has no parameters and gives high likelihood to the observed data: pA(X)1p_A(X) \approx 1. Model B, on the other hand, is more flexible. It has a Boolean parameter θ\theta, which can be set to explain our data well (pB(Xθ=0)1p_B(X | \theta=0) \approx 1), but can also be set predict a different set of observations YY (pB(Yθ=1)1    pB(Xθ=1)0p_B(Y| \theta = 1) \approx 1 \implies p_B(X| \theta = 1) \approx 0).

Bayesian model selection considers the marginal likelihood of each model, integrating over parameter values. Assuming (for simplicity) a uniform prior on θ\theta, we get

pB(X)=pB(Xθ=0)pB(θ=0)+pB(Xθ=1)pB(θ=1)112+012=12\begin{align*} p_B(X) &= p_B(X | \theta=0)p_B(\theta = 0) + p_B(X | \theta=1)p_B(\theta = 1)\\ &\approx 1 \cdot \frac{1}{2} + 0 \cdot \frac{1}{2}\\ &= \frac{1}{2} \end{align*}

which is a lower likelihood than pA(X)1p_A(X) \approx 1 given above. Because model B can explain data other than what we observed, it fundamentally cannot assign as much mass to those observations as model A does.