Modified: March 03, 2022
Occam's razor
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.If two hypotheses are equally consistent with the data, the simpler is more likely to be 'true'. Formally, it is more likely to generalize, as in the Occam generalization bound.
Occam's Razor is the philosophical intuition behind the learning principle of minimum description length.
Bayesian Occam's Razor
Occam's razor is often invoked to motivate explicit complexity penalties. Notably, Bayesian model selection and model averaging will tend to prefer 'simpler' models even without any explicit complexity penalty. The requirement that probabilities normalize to effectively serves as an automatic complexity penalty, since a model that can explain many different outcomes will be unable to give much mass to any individual outcome.
Concretely, say we've observed data , and are choosing between models and . Model A has no parameters and gives high likelihood to the observed data: . Model B, on the other hand, is more flexible. It has a Boolean parameter , which can be set to explain our data well (), but can also be set predict a different set of observations ().
Bayesian model selection considers the marginal likelihood of each model, integrating over parameter values. Assuming (for simplicity) a uniform prior on , we get
which is a lower likelihood than given above. Because model B can explain data other than what we observed, it fundamentally cannot assign as much mass to those observations as model A does.