classification is special: Nonlinear Function
Created: March 12, 2021
Modified: March 12, 2021

classification is special

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

The distinction between classification and regression is, from one point of view, arbitrary: it's all just function approximation, and most of the machinery is the same.

But there are also real qualitative differences:

  • Discussions of concepts like overparameterization and double descent rely, in their simplest form, on being able to fit data perfectly. These concepts work 'natively' for classifiers (which really can have zero error) and more awkwardly for regression.
  • If a classifier has zero training error, the scale of its weights is typically unidentifiable. In the linear case, if we have ww such that f(x)=w,x>0f(x) = \langle w, x \rangle > 0 is a perfect classifier, then the same is true for cwcw for any c. And a proper scoring rule will tend to push cc \to\infty, since, empirically, the classifier is 100% accurate.
  • Regression models are almost always mis-specified, because we can only parameterize a tiny subset of all continuous distributions. By contrast, classifiers can at least potentially capture any distribution over discrete outcomes.
  • It's common in applications to treat regression as binned classification. Essentially, this is about approximating continuous distributions by histograms, rather than a more limiting parametric form. This allows us to represent multimodality.
  • All of language modeling is classification. If we take the Turing test seriously, this means that classification is sufficient for general intelligence, so dealing with the idiosyncrasies of regression is not going to be on the critical path.