Modified: December 14, 2022
free lunch theorem
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.'no free lunch theorem' arguments are misleading because they consider the space of all possible functions. In fact, we usually care about learning functions that have some nice form or that can be compactly represented, and it turns out that such a limitation is itself enough to guarantee generalization in a very general sense. David McAllester calls this fact the 'free lunch theorem':
- If your hypothesis class is the set of all computer programs in an environment that you've chosen ahead of time (e.g., Python with some set of standard libraries), and
- You find a hypothesis in this class that performs well on the training data, and can be encoded using significantly fewer bits than the training set,An example of the connection between learning and compression.
- Then this hypothesis will perform well on test data with high probability.
The basic observation is that because the hypothesis class is much smaller than the set of all possible hypotheses (arbitrary maps from training points to labels), overfitting is (in some sense) impossible: the hypothesis class is too small to fit arbitrary labels, so if it can fit the labels we see, that’s a sign that the labels we see aren’t arbitrary, so there's a good chance the hypothesis will generalize. See the Occam generalization bound for a formalization of this argument.
McAllester summarizes the no free lunch theorem as 'Structureless functions exist. We are not going to learn structureless functions.' That statement is almost tautological once you reflect on it, but it leaves a lot of room for learning to work even in even very general hypothesis classes like 'the class of all short Python programs'.