Occam generalization bound: Nonlinear Function
Created: March 03, 2022
Modified: April 20, 2022

Occam generalization bound

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Here's a formal argument for Occam's razor, adapted from Theorem 2.2 of Kearns and Vazirani's "An Introduction to Computational Learning Theory" and Theorem 1 from David McAllester's Pac-Bayes tutorial (https://arxiv.org/abs/1307.2118).

Result

Let L(h)L(h) denote the population-level mistake probabilityThis analysis applies to classification with 0/1 loss, where perfect fitting is possible. It can be generalized to other losses using concentration inequalities; see the McAllester tutorial for details. (aka generalization error) of a hypothesis hHh\in\mathcal{H}, and call a hypothesis 'bad' if L(h)>ϵL(h) > \epsilon. A 'bad' hypothesis might still happen to label a given input correctly, but it will do this with probability at most 1ϵ1 - \epsilon, by definition. It follows that the probability that any given 'bad' hypothesis will perfectly fit nn independently selected training examples is at most (1ϵ)n(1 - \epsilon)^n.

Let HH\mathcal{H}' \subseteq \mathcal{H} be the set of 'bad' hypotheses: by a union bound, the probability that some 'bad' hypothesis is consistent with the data is at most H(1ϵ)n|\mathcal{H}'|(1 - \epsilon)^n which is itself at most H(1ϵ)n|\mathcal{H}|(1 - \epsilon)^n. If we require this to be at most δ\delta:

H(1ϵ)nδ|\mathcal{H}|(1 - \epsilon)^n \le \delta

then it follows thatThe result that log11ϵϵ\log \frac{1}{1 - \epsilon} \approx \epsilon follows from a first-order Taylor expansion of the natural logarithm.

nlogH+log1/δlog11ϵ1ϵ(logH+log1/δ)n \ge \frac{\log |\mathcal{H}| + \log 1/\delta}{\log \frac{1}{1 - \epsilon}} \approx \frac{1}{\epsilon}\left(\log |\mathcal{H}| + \log 1/\delta\right)

is sufficient to guarantee with probability at least 1δ1-\delta that a hypothesis consistent with the training data will have test loss of at most ϵ\epsilon.

Discussion

Intuitively, the empirical (training) loss of any hypothesis is a random variable, which by the law of large numbers converges to the population (test) loss as nn\to\infty. Failures of generalization occur when a hypothesis 'gets lucky' on a given dataset by obtaining an empirical loss significantly better than its population loss. The fewer hypotheses we consider, the fewer chances there are for this to happen, so limiting ourselves to smaller hypothesis classes leads to more reliable generalization.Importantly, these hypotheses don't have to be 'simple' in any recognizable sense of the term. The size of the class is all that matters for this bound: a class containing ten randomly-selected crazy hypotheses is just as good as one containing the ten 'simplest' possible hypotheses by some subjective criterion. The question of what gets to count as 'simple' is analogous to the choice of a compression algorithm, or the the choice of programming environment that determines the constant term in Kolmogorov complexity.

What does this tell us in practice? Say we want to guarantee with 99% probability (δ=0.01\delta = 0.01) that a hypothesis consistent with the training set will have test loss of <0.01=ϵ< 0.01 = \epsilon. Under this bound, we'd need

n68.97(log2H+6.64)n \ge 68.97 \cdot (\log_2 |\mathcal{H}| + 6.64)

where log2H\log_2 |\mathcal{H}| is the number of bits to represent the hypothesis class.Here we've divided by ln20.69\ln 2 \approx 0.69 to convert the logarithms to base 2. This implies a need for about 69 extra training points every time we double the size of the hypothesis space. For example, if our hypothesis space is all functions that can be expressed as Python programs of less than 1000 characters (8000 bits), then we'd expect to need on the order of 458 + 69 * 8000 = about 552458 training examples.

Note that this bound has no dependence on the size of the input space. Of course, if there are fewer than 552458 possible inputs, the bound is not very interesting. But importantly, it works even if the input space is enormously large; say, the set of all 20-megapixel images represented with 32-bits-per-pixel.

Qualitatively, what this analysis shows is that if your hypothesis class is too big, then it becomes more likely that some hypothesis will 'get lucky' and fit the training data even though it performs poorly on test data. Choosing such a hypothesis is called overfitting. The way to prevent this is to constrain your hypothesis class, either explicitly or through regularization.

Failed analyses

To understand what the result is doing, it's instructive to understand some plausible analyses that don't quite work.

We don't need to pick the correct hypothesis. We might be tempted to consider the probability that we've picked out exactly the correct function hh^*, but this is too strong a condition; it doesn't give a good bound. Assume for simplicity that the data distribution is the uniform distribution over the input space, so pD(x)=1/Dp_\mathcal{D}(x) = 1 / |D| for all inputs xx. Since any incorrect hypothesis hhh \ne h^* must disagree with hh^* on at least one input point, it follows that hh makes a mistake with probability 1/D\ge 1 / |D|. So the probability that a wrong hypothesis hh classifies all of our nn training samples correctly is (11/D)n\le (1 - 1 / |D|)^n. Using Bayes' rule, we have

p(hhh makes no mistakes on training set)(11/D)np(hh)(11/D)np(hh)+1p(h=h)p(h \ne h^* | h \text{ makes no mistakes on training set}) \le \frac{\left(1 - 1 / |D|\right)^n \cdot p(h \ne h^*)}{\left(1 - 1 / |D|\right)^n \cdot p(h \ne h^*) + 1 \cdot p(h = h^*)}

Under a uniform prior on hypotheses we have p(h=h)=1/Hp(h = h^*) = 1 / |\mathcal{H}| and p(hh)=H1Hp(h \ne h^*) = \frac{|\mathcal{H}| - 1}{|\mathcal{H}|}. Substituting this in and multiplying through, we have

p(hhh makes no mistakes on training set)(11/D)n(H1)(11/D)n(H1)+1(11/D)nH(11/D)nH+1p(h \ne h^* | h \text{ makes no mistakes on training set}) \le \frac{\left(1 - 1 / |D|\right)^n \cdot \left(|\mathcal{H}| - 1\right)}{\left(1 - 1 / |D|\right)^n \cdot \left(|\mathcal{H}| - 1\right) + 1} \le \frac{\left(1 - 1 / |D|\right)^n \cdot |\mathcal{H}|}{\left(1 - 1 / |D|\right)^n \cdot |\mathcal{H}| + 1}

Now suppose we want to bound the probability of choosing the wrong hypothesis (the left hand side) by some small value ϵ\epsilon. Then we have ϵXX+1\epsilon \le \frac{X}{X + 1}, or equivalently Xϵ1ϵX \le \frac{\epsilon}{1 - \epsilon} for X=(11/D)nHX = \left(1 - 1 / |D|\right)^n \cdot |\mathcal{H}|. We can already tell that this won't yield a useful bound, because to make this small we'd need (11/D)n\left(1 - 1/D\right)^n small, which would require n>Dn > D, i.e., we'd have to see enough points so that we're likely to see every possible point during training.

The problem with the above analysis is that it tries to prove too much. Since other functions hh could differ from the true function hh^* on only a single input, we have to see every input to be sure of getting the true function. This is true even if H\mathcal{H} is very small; for example, even if it contains just two possible hypotheses, those hypotheses can still be very hard to distinguish.

Not a Bayesian analysis of hypotheses. We might identify two sources of uncertainty in the standard iid learning model:

  • We don't know which hypothesis generated the population.
  • We don't know which population members will be selected to form the training data.

The probabilities in the analyses above account only for the latter of these: uncertainty over the training data. The Occam bound shows that with high probability we will sample data such that all hypotheses end up within some small fudge margin of their expected loss. This requires no assumptions about the distribution of losses incurred by individual hypotheses, but at least in principle we should be able to get tighter bounds by making such assumptions.

Let L^(h)\hat{L}(h) denote the empirical/training loss of a hypothesis hHh\in \mathcal{H}, and L(h)L(h) the population/test loss. Let p(h)p(h) be a prior distribution over hypotheses (perhaps the uniform distribution). Suppose we see nn training examples and select a hypothesis h^\hat{h} that works on all of them (so L^(h^)=0\hat{L}(\hat{h}) = 0). Conditioned on this criterion there is a unique posterior distribution over the test loss L(h^)L(\hat{h}) , and in particular, a posterior probability that L(h^)>ϵL(\hat{h}) > \epsilon for any error rate ϵ\epsilon. Indeed, by Bayes' rule, this probability is

p(L(h)>ϵL^(h)=0)=p(L^(h)=0L(h)>ϵ)p(L(h)>ϵ)p(L^(h)=0)\begin{align*} p(L(h) > \epsilon | \hat{L}(h) = 0) &= \frac{p(\hat{L}(h) = 0 | L(h) > \epsilon)p(L(h) > \epsilon)}{p(\hat{L}(h) = 0)} \end{align*}

Insofar as Bayesian reasoning gives precise answers given all conditioning information, this is in some sense the 'perfect' generalization bound. Unfortunately, it's not particularly useful, because it requires access to the prior distribution on population losses p(L(h))p(L(h)) implied by p(h)p(h): in particular, the prior probability that a hypothesis has population loss greater (or less) than ϵ\epsilon, and the likelihood that such a hypothesis has training loss of zero.This likelihood depends on the full distribution of population losses; just knowing that the loss is greater or less than ϵ\epsilon is not sufficient to compute it. To see this, observe that a hypothesis space in which the population loss is concentrated at a point just barely above ϵ\epsilon is different from one in which there's a large tail of losses well above ϵ\epsilon. But in general we don't know the population loss of any of our hypotheses, let alone all of them!If we did know the population loss of every hypothesis, the learning problem would be trivial, since we'd just choose the hypothesis with the lowest loss.