Modified: March 03, 2022
A Universal Law of Robustness via Isoperimetry
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.Link: A Universal Law of Robustness via Isoperimetry | OpenReview
This paper purports to explain (and quantify) the observed fact that overparameterized deep networks seem to be more robust to adversarial examples.
Summary: We can generally use parameters to fit unknowns, but there's no guarantee what will happen at other points in the input space. It turns out that we actually need parameters, where is the dimension of the input space, if we want to guarantee that the interpolating function is smooth, in the Lipschitz sense.
We might intuitively think of smoothness as adding an additional constraints for each input point, saying that moving slightly in any of the possible directions shouldn't change the function value by very much. (though this isn't quite right because a smooth function is smooth everywhere).
It's easy to see that parameters are sufficient: we can interpolate any set of points using parameters by putting a finite-width bump at each of the points, where each bump uses parameters for its location and 1 parameter for its height. This works as long as the bumps are small enough to not collide with each other. The expected distance between any two points from a -dimensional Gaussian increases as , so in high dimensions we can get away with using wider (smoother) bumps.
The result requires that the distribution of input points satisfies isoperimetry: for any bounded -Lipschitz function , the random variable must satisfy
In other words, the distribution of inputs must be such that any -Lipschitz random variable defined on them has sub-Gaussian tails, after rescaling by .How does this relate to the standard isoperimetric inequalities? This assumption turns out to do most of the work.
Proof sketch: Let be an -Lipschitz function with outputs in , and without loss of generality suppose that under the input distribution. Suppose for simplicity that we require to exactly fit the training points. By isoperimetry, the probability that is bounded by something like , so the probability that fits all positive training points is at most . Repeating this argument for every function in our hypothesis class, by a union bound we get that the probability that any fits all of the positive points is at most
So if , it is unlikely that an -Lipschitz function will be able to fit our data. For any smoothly parameterized family with parameters we expect by a standard discretization argument; it follows that should be on the order of in order for our family to be big enough.
The full proof in the paper accounts for label noise (imperfect fitting) and allows to be a mixture of isoperimetric distributions, but doesn't give much more insight AFAICT.
The paper includes an extended analysis of the implications for deep networks, and speculates that we might get robust ImageNet networks just by making our models another one or two orders of magnitude larger.
Discussion
Q: Is the isoperimetry assumption realistic?
A: It holds for Gaussian distributions, and for uniform distributions on the high dimensional sphere and hypercube, among others, so it doesn't seem crazy.
Q: How does this connect to double descent?
A: This paper argues that overparameterization is necessary to find a function that smoothly interpolates high-dimensional data. If we believe our target function is smooth, then only an high-capacity model is likely to capture it. Double descent can in principle be more general: it observes that larger families can contain simpler models according to arbitrary notions of simplicity, not just smoothness. But this gives at least one notion of simplicity that we concretely expect to need larger families for.
Q: Does a smooth interpolation always exist?
A: A given set of labels may not admit an -Lipschitz interpolation for a given L: for example, the pair of 1D points does not admit a -Lipschitz interpolation. For any dataset there is some minimum below which interpolation is not possible. So we can't get arbitrarily smooth interpolations, but we can hope to get within a constant factor of this minimal .
Q: Do we really need parameters to represent a smooth function?
A: Of course we can represent any given smooth function with zero parameters by simply choosing our function class to contain exactly that function and nothing else, so in a sense no. But such a class can't represent an arbitrary labeling of points.
Q: The finite-width bump construction used for intuition is a bit weird: it's literally memorizing the training data. Indeed, any model with parameters is capable of memorizing the training data. But memorizing the data is bad! Any model that generalizes will need to abstract away from and compress the raw training set, not memorize it. What gives?
A: The authors speculate that the bound should be interpreted in terms of the effective data dimension. If the data live on a -dimensional manifold for , then perhaps we need only parameters to smoothly interpolate between them. The problem would reduce to finding that manifold corresponding to the optimal 'feature representation' of the data, and then memorizing labels in that space.
Q: Does this result mean that we'll need infinitely large models as ?
A: The result is that we need parameters to smoothly interpolate an arbitrary function of points. If we assume that the true data-generating function is smooth, then there is some at which we will have essentially tiled the space, so any new points we see will be 'close enough' to existing points that they effectively give no new information because we can predict their values by smooth interpolation of the existing points. Heuristically: suppose we divide the unit square into bins of side width , then to fill all the bins we would expect (TODO: this isn't quite right since we should really be analyzing balls, not bins). Conversely we can also observe that for any given there is a minimum at which we should expect to be able to fit an arbitrary label set.
Q: Does this analysis require a random function, or a distribution over functions?
A: No. The only randomness is over the choice of the inputs (and in the full analysis, the label noise ).
Q: How does this relate to an Occam generalization bound, which implies that fewer parameters and smaller hypothesis classes are better?
A: The Occam bound shows that if a hypothesis in has low training error, then it is likely to also have low generalization error as long as is not too large. This result shows that if we take to be a class of smooth functions, we are unlikely to even get to the point of finding a hypothesis with low training error, unless is at least of size . Once we do get there, it says nothing about whether such a function will generalize. Indeed, if smoothness is the only structure we assume, then 'generalization' would require observing enough points to effectively tile the input space.This is a version of the no free lunch theorem, with the smoothness assumption effectively discretizing the space. That would require points, which as we saw above is exponentially large, so in general we'll need to make additional assumptions.