Modified: February 25, 2022
kernel
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.- multiple senses:
- in machine learning: positive definite (Mercer) kernels
- in linear algebra: kernel (nullspace) of a linear map
- in CS systems: operating system kernel
Why I don't like kernels in machine learning: (blog posts to write)
They are a way to specify inductive bias, but it's not clear that they're a very useful one. They're mathematically pretty, but rarely something we can easily reason about. Kernel design has never caught on as a modeling formalism.
Kernels betray the general principle that computation is important.
It's a pain to prove an arbitrary function to be positive definite, so the practical range of kernel design is limited to a few ways of combining a relatively small library of 'base' kernels.
Kernels take the idea that ML is about learning functions a bit too literally. They focus on 'mathy' properties of functions like continuity, smoothness, periodicity, spectra, lengthscale: things that show up when you plot the function. But most of these are 'low-dimensional' properties; they have no inherent leverage on the weirdness of high-dimensional spaces, where distance is a meaningless concept because everything is far from everything. Most 'kernels' defined on high-dimensional spaces are really a thin layer of kernel machinery on top of some other formalism that actually does the heavy lifting, e.g., using the feature representation from the last layer of a deep network.
But what about Gaussian processes?
- I've worked on GPs. They're beautiful. I think they're the best argument for caring about kernels, and they're a useful technology in a few areas, such as Bayesian optimization, or spacial statistics. But they're not the answer to machine learning.
- Kernels are still often not the best way to specify a GP. At Google I implemented a package for structural time series modeling, which builds Gaussian process models of time series, but we didn't call them that, because when you say 'GP' people immediately think about the covariance-function representation. But because that representation ignores computation it's much less efficient than the equivalent state-space-model representation, which is what STS uses.