no free lunch theorem: Nonlinear Function
Created: October 02, 2021
Modified: March 04, 2022

no free lunch theorem

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

The folklore no-free-lunch 'theorem' in machine learning says that, for any pair of learning algorithms, there exists some dataset on which the first does better, and another where the second does better. Thus there is no 'best' learning algorithm across all possible datasets.

Put differently: the performance of a learning algorithm depends on assumptions about the data, which are by nature true for some datasets but false for others. These assumptions constitute the algorithm's 'bias', and all learning algorithms necessarily have some bias in this sense.

Fortunately it turns out that even very weak assumptions, like 'the data has some sort of structure', are enough to enable learning in very general situations, so free lunches do exist for practical purposes (see the free lunch theorem). For this reason I think that no-free-lunch arguments are best seen as a reminder to be aware of the assumptions that our algorithms make rather than any sort of hard barrier.

Formalization for Boolean functions

There is no single 'no free lunch' theorem; versions of the argument can be formalized in various models of learning. Here's a simple one that captures the common flavor.

We want to learn a Boolean function on length-dd binary strings: f:{0,1}d{0,1}f : \{0, 1\}^d \to \{0, 1\}. That is, we'll see a dataset consisting of pairs (xi,f(xi))(x_i, f(x_i)), where each xix_i is a binary string and each f(xi){0,1}f(x_i)\in\{0, 1\} a Boolean label for that string (if you like, you can imagine that the strings represent image pixels, and f(xi)f(x_i) indicates whether the corresponding image contains a cat), and our job is to recover the function ff. Formally we'll say that a learning algorithm is a map from a dataset to a function in the hypothesis space.

Counting, we see that there are 2d2^d possible input strings, each of which has two possible outputs, so there are exactly 22d2^{2^d} possible functions. How well should we expect learning algorithms to work? How much data do you need to learn the true underlying function?

Observing the label of any new point cuts the hypothesis space in half, since we can rule out the hypotheses that gave the 'wrong' label to that point. You might think this would be effective, because when you keep cutting things in half they get small exponentially quickly. But the space of all Boolean functions was giant, of size 22d2^{2^d}, so winnowing it down to a single hypothesis will require log222d=2d\log_2 2^{2^d} = 2^d inputs. That's the size of the entire input space!

We conclude that because our hypothesis space is so big, learning a function requires us to see its behavior on all possible inputs. For any given possible labeling of the first 2d12^d−1 inputs, there is a hypothesis consistent with that labeling that labels the final point as 0, and an otherwise identical hypothesis that labels it as 1; the previous labels do nothing to help us predict at the new input. In the absence of any further assumptions about the hypothesis space, learning is impossible in this setting.