A Universal Law of Robustness via Isoperimetry: Nonlinear Function
Created: March 02, 2022
Modified: March 03, 2022

A Universal Law of Robustness via Isoperimetry

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Link: A Universal Law of Robustness via Isoperimetry | OpenReview

This paper purports to explain (and quantify) the observed fact that overparameterized deep networks seem to be more robust to adversarial examples.

Summary: We can generally use nn parameters to fit nn unknowns, but there's no guarantee what will happen at other points in the input space. It turns out that we actually need ndnd parameters, where dd is the dimension of the input space, if we want to guarantee that the interpolating function is smooth, in the Lipschitz sense.

We might intuitively think of smoothness as adding an additional dd constraints for each input point, saying that moving slightly in any of the dd possible directions shouldn't change the function value by very much. (though this isn't quite right because a smooth function is smooth everywhere).

It's easy to see that O(nd)O(nd) parameters are sufficient: we can interpolate any set of points using n(d+1)n(d + 1) parameters by putting a finite-width bump at each of the nn points, where each bump uses dd parameters for its location and 1 parameter for its height. This works as long as the bumps are small enough to not collide with each other. The expected distance between any two points from a dd-dimensional Gaussian increases as d\sqrt{d}, so in high dimensions we can get away with using wider (smoother) bumps.

The result requires that the distribution of input points p(x)p(x) satisfies isoperimetry: for any bounded LL-Lipschitz function ff, the random variable f(x)f(x) must satisfy

p(f(x)Ef(x)>t)edt2/(2cL2)p\left(|f(x) - \mathbb{E}f(x)| > t\right) \le e^{-dt^2 / (2cL^2)}

In other words, the distribution of inputs must be such that any LL-Lipschitz random variable defined on them has sub-Gaussian tails, after rescaling by d/L2d/L^2.How does this relate to the standard isoperimetric inequalities? This assumption turns out to do most of the work.

Proof sketch: Let ff be an LL-Lipschitz function with outputs in R\mathbb{R}, and without loss of generality suppose that Ef(x)0\mathbb{E}f(x) \le 0 under the input distribution. Suppose for simplicity that we require ff to exactly fit the training points. By isoperimetry, the probability that f(x)=1f(x) = 1 is bounded by something like ed/L2e^{-d/L^2}, so the probability that f(x)f(x) fits all O(n)O(n) positive training points is at most end/L2e^{-nd/L^2}. Repeating this argument for every function in our hypothesis class, by a union bound we get that the probability that any fFf \in \mathcal{F} fits all of the positive points is at most

Fend/L2=exp{logFnd/L2}.|\mathcal{F}|e^{-nd/L^2} = \exp\{\log |\mathcal{F}| - nd/L^2\}.

So if Lnd/logFL \ll \sqrt{nd / \log |\mathcal{F}|}, it is unlikely that an LL-Lipschitz function will be able to fit our data. For any smoothly parameterized family with pp parameters we expect logF=O~(p)\log |\mathcal{F}| = \tilde{O}(p) by a standard discretization argument; it follows that pp should be on the order of ndnd in order for our family to be big enough.

The full proof in the paper accounts for label noise (imperfect fitting) and allows p(x)p(x) to be a mixture of isoperimetric distributions, but doesn't give much more insight AFAICT.

The paper includes an extended analysis of the implications for deep networks, and speculates that we might get robust ImageNet networks just by making our models another one or two orders of magnitude larger.

Discussion

Q: Is the isoperimetry assumption realistic?
A: It holds for Gaussian distributions, and for uniform distributions on the high dimensional sphere and hypercube, among others, so it doesn't seem crazy.

Q: How does this connect to double descent?
A: This paper argues that overparameterization is necessary to find a function that smoothly interpolates high-dimensional data. If we believe our target function is smooth, then only an high-capacity model is likely to capture it. Double descent can in principle be more general: it observes that larger families can contain simpler models according to arbitrary notions of simplicity, not just smoothness. But this gives at least one notion of simplicity that we concretely expect to need larger families for.

Q: Does a smooth interpolation always exist?
A: A given set of labels may not admit an LL-Lipschitz interpolation for a given L: for example, the pair of 1D points (f(0)=0,f(1)=1000)(f(0) = 0, f(1) = 1000) does not admit a 11-Lipschitz interpolation. For any dataset there is some minimum LL below which interpolation is not possible. So we can't get arbitrarily smooth interpolations, but we can hope to get within a constant factor of this minimal LL.

Q: Do we really need ndnd parameters to represent a smooth function?
A: Of course we can represent any given smooth function with zero parameters by simply choosing our function class F\mathcal{F} to contain exactly that function and nothing else, so in a sense no. But such a class can't represent an arbitrary labeling of nn points.

Q: The finite-width bump construction used for intuition is a bit weird: it's literally memorizing the training data. Indeed, any model with O(nd)O(nd) parameters is capable of memorizing the training data. But memorizing the data is bad! Any model that generalizes will need to abstract away from and compress the raw training set, not memorize it. What gives?
A: The authors speculate that the bound should be interpreted in terms of the effective data dimension. If the data live on a kk-dimensional manifold for k<dk < d, then perhaps we need only knkn parameters to smoothly interpolate between them. The problem would reduce to finding that manifold corresponding to the optimal 'feature representation' of the data, and then memorizing labels in that space.

Q: Does this result mean that we'll need infinitely large models as nn\to\infty?
A: The result is that we need O(nd)O(nd) parameters to smoothly interpolate an arbitrary function of nn points. If we assume that the true data-generating function is smooth, then there is some nmaxn_\text{max} at which we will have essentially tiled the space, so any new points we see will be 'close enough' to existing points that they effectively give no new information because we can predict their values by smooth interpolation of the existing points. Heuristically: suppose we divide the unit square into bins of side width 1/L1/L, then to fill all the bins we would expect nmaxO(1/Ld)n_\text{max} \approx O(1/L^d) (TODO: this isn't quite right since we should really be analyzing balls, not bins). Conversely we can also observe that for any given nn there is a minimum LL at which we should expect to be able to fit an arbitrary label set.

Q: Does this analysis require a random function, or a distribution over functions?
A: No. The only randomness is over the choice of the inputs xix_i (and in the full analysis, the label noise ϵi\epsilon_i).

Q: How does this relate to an Occam generalization bound, which implies that fewer parameters and smaller hypothesis classes are better?
A: The Occam bound shows that if a hypothesis in F\mathcal{F} has low training error, then it is likely to also have low generalization error as long as F\mathcal{F} is not too large. This result shows that if we take F\mathcal{F} to be a class of smooth functions, we are unlikely to even get to the point of finding a hypothesis with low training error, unless F\mathcal{F} is at least of size exp{O(nd)}\exp\{O(nd)\}. Once we do get there, it says nothing about whether such a function will generalize. Indeed, if smoothness is the only structure we assume, then 'generalization' would require observing enough points to effectively tile the input space.This is a version of the no free lunch theorem, with the smoothness assumption effectively discretizing the space. That would require nmaxn_\text{max} points, which as we saw above is exponentially large, so in general we'll need to make additional assumptions.