Created: September 06, 2021
Modified: September 06, 2021

large models

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

If you believe that neural nets basically just memorize the training data, then training larger and larger models is hopeless. The compositionality of the world means that exponentially many things can happen, and you'll never collect enough data to understand all of them. This is the argument for building world models that are themselves explicitly compositional, and reasoning in them.
But the training objectives we use are fundamentally about compression: by maximizing a regularized likelihood, we're minimizing the number of bits needed to encode a training example. And when you think about compressing data that has lawful regularities, it's obvious that there's some transition point where you get enough data that you're better off explicitly encoding the regularities (the phase change hypothesis).
- It takes some large constant number of bits $C$ to encode the lawful regularities.
- Once you've encoded the regularities, they can save you a small number of bits $k$ on every future example.
So if you have $nk < C$ then you're better off just memorizing the data, but for large enough $n$ you'll have $nk \gg C$ , at which point you'll be better off learning the regularities. There's a phase transition.
- TODO: what does this say about compositionality? I want to say that we don't need an exponential amount of training data.
In supervised ML, you're just trying to compress labels, each of which is itself only a very small number of bits. So the value of $k$ is inherently limited. You're most likely going to be in the memorization regime.
Of course, it's not a binary choice: there are many regularities in the world, and each has effectively its own value of $C$ and $k$ . However, things are not totally independent because some regularities may be cheaper to encode jointly than the sum of their individual encodings (due to shared cognitive structure)).
There are lots of 'surface' regularities, like character frequencies in language, or things like edges and texture in vision, that have relatively high $k / C$ ratios and so get encoded quite early on.
What we've seen with GPT is that large language models are starting to get to the point where it's profitable for them to learn some of the deeper regularities.
Enterprises like Tesla's self-driving are bets that they can get into the $nk \gg C$ regime for the regularities of interest.

large models

Links to this note

emergent capabilities

phase change hypothesis

Meta