Modified: March 02, 2022
double descent
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.Empirically, as model capacity increases past the memorization threshold (), generalization error starts decreasing again. 'Model capacity' here is measured in number of parameters, number of training steps, strength of regularization, or (equivalently in a relative sense) smallness of dataset.
An intuitive explanation is that the loss generally consists of a data-fit term and a regularization term. At first, the data-fit term dominates, forcing the model to come up with an explanation that fits the data, even if the explanation is somewhat convoluted (think adding epicycles). At the threshold () there may be only one such explanation available. But, given additional capacity, the model will have a choice of multiple explanations, and guided by the regularization term it will prefer the simplest. These simpler explanations are more likely to generalize.
Counterintuitively, an overparameterized model can contain simpler explanations than a less-parameterized model. Consider fitting an -degree polynomial to points, versus fitting an -degree polynomial for . In the latter case, we have the option of just using the former fit, with zeros for the added coefficients, so the higher-degree polynomial will never require a larger-norm coefficient vector to fit. But we also have many other options, so in general, we will obtain a lower-norm coefficient vector.
In fact, there are theoretical reasons to expect that overparameterized models allow smoother interpolations.
Surprisingly, double descent shows up empirically even when there is no explicit regularization term. Our current thinking is that this is because SGD is itself implementing a form of implicit regularization, but this is not well understood (at least by me).
Q: What is the Bayesian relevance of double descent? A: In some sense it isn't surprising. Integrating over a larger hypothesis space should, ultimately, always be better than integrating over a smaller one. Even if the small space contains hypotheses that explain the data, the larger space may contain hypotheses with higher prior probability (ie, lower complexity).The nature of the prior does matter for this effect. For example, a high-degree polynomial might in general have higher prior probability than a lower-degree polynomial, but not if we use a prior inversely proportional to the degree.