computation is important: Nonlinear Function
Created: January 16, 2022
Modified: September 13, 2022

computation is important

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Arguably the core insight of deep learning / differentiable programming is that the shape and structure of the computations we do are so critical that they need to be the first-order focus of our design efforts.

Machine learning likes to come up with mathematically 'nice' abstraction layers for turning declarative representations of inductive bias into statistical conclusions:

These have their uses, but the impulse is ultimately misguided from the perspective of engineering general intelligence. Inductive bias is important, but so is efficient computation, and abstraction layers that decouple these make a fatal error because they prevent you from exploiting synergies at the implementation level. Maintaining strict boundaries between Marr's levels may be useful for a theory of intelligence, but it's extremely costly as an engineering strategy.

Both kernels and Bayesian inference are computationally problematic. Exact evaluation requires N2N^2 time for the former, and likely-exponential time for the latter given that it is PSPACE-complete. Committing to either framework means we are now dependent on a research agenda of 'scaling kernels' or 'approximate inference', respectively. These attempt to find computationally feasible approximations, but are inherently limited in doing so because they're working from a representation that (by design) ignored computational concerns.

Approximate Bayesian inference is a valuable exercise even so, because the associated modeling formalism --- specifying a generative process or probabilistic program --- can be truly natural and intuitive (and, uniquely, gives a principled treatment of uncertainty). I'm less enthusiastic about the kernel modeling formalism: it's not particularly natural or intuitive, so why should we go to such effort to salvage it?

In practice computation is important: it's been much more effective to think about inductive bias in terms of the 'shape of the computation' that must be done. deep learning architectures do this:

  • An MLP understands that its computation is a composition of simpler computations.
  • An RNN understands that its computation needs to move across a sequence.
  • A convnet understands that its input is a 2D grid, where each piece of the grid should be processed in a similar way (translation invariance), and that features must be computed at multiple scales.
  • An attention mechanism understands that the model needs to be able to select its own computations at runtime.

The broader SGD paradigm inhabited by deep learning methods also gets the meta-level shape of machine learning more right: training should be an 'anytime' process, with results getting progressively better as you see more data. In particular, getting more data shouldn't make the system slower: the cost of test-time evaluation should be independent of how much data you've seen. These considerations rule out direct application of most non-parametric methods. Indeed, one reason random features are such a great idea is that they allows kernel methods to fit into this paradigm.

Another related perspective is Tom Griffith's hypothesis that intelligence is what it is due to its limitations. https://cocosci.princeton.edu/papers/griffithsunderstanding.pdf