phase change hypothesis: Nonlinear Function
Created: February 01, 2022
Modified: February 10, 2022

phase change hypothesis

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

(see also: large models)

There's a viewpoint that neural nets just memorize the training data, so the more training data you have, the more memorization you need to do. But there are combinatorically many situations you need to respond to, and you're just not going to be able to work with a combinatorial amount of training data. Maybe the size of data we can handle is growing exponentially at the moment, with hardware scaling, but the space of data that we need to need to handle is also exponentially large, and we're probably not going to fully cover it. So you could expect that neural nets will always have weird edge cases where they do strange stuff.

I think this is wrong, due to what we might call the 'phase change hypothesis' (which is supported by the recently observed grokking phenomenon). The hypothesis is that there is a threshold at which it becomes more effective for a network to actually learn a model of the situation than for it to continue memorizing. For example, say I train a language model on text that contains addition problems. So like, within your within your corpus you have things like seven plus three equals 10, you know, two plus 14 equals 16, whatever, you know, up to a point, you can just memorize the answers to those problems. And with a small data set, like if you're just looking at like all two digit numbers, then memorizing those results might be more effective than actually trying to understand how to do addition. But if you have a big enough data set, with sufficient diversity of problems, then eventually the most compact representation of that dataset involves actually specifying an algorithm to do addition. This may not be the first order structure, it may not even be the second order structure, but eventually you will have to learn that.

Pessimists would say that to build a system to solve addition, we need to show it every addition problem ever. The phase change hypothesis says that we just need to show it enough addition problems, to the point where the size of the lookup table is bigger than the size of the ground truth algorithm. At that point it has to learn the ground truth algorithm.

Now of course this assumes away a lot of the mechanics of neural nets. It assumes that they're capable of representing the ground truth algorithm; this is where you get into questions of things like circuit complexity, and the ability of SGD to find optima. But GPT is a pretty impressive proof of concept, and it's only getting better.