Created: September 03, 2022
Modified: September 03, 2022

transformers with memory

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Incorporating explicit memory and retrieval seems pretty clearly like the next frontier in language modeling and AI more broadly. We have systems like GPT-3 that view the entire internet and store whatever they learn in some giant associative memory trained via SGD. But these are imperfect in several ways.

Short context windows: can't generate long structured text (novels, etc) or have long coherent conversations. (see long-term context in Transformers).
Updatability. The only way to teach a new fact is to include it in a prompt (reducing to the previous problem of prompt length limitations) or to do lots of expensive fine-tuning.
Uninterpretable: we have no way to explicitly query the model's "beliefs" or to attribute an answer to a particular belief (and indeed with appropriate prompting these models can espouse conflicting beliefs - this is 'humanlike' and not necessarily a bug as such, but problematic in some cases).
Inefficient: we're storing and using all of those 175B parameters even for tasks/topics that implicate only a small subset of the training set.
Unsatisfying distinction between training time updates (SGD) and test-time 'updates' in context window (not the ideal meta-level shape of machine learning).
One-pass SGD training can fail to memorize training set (a feature, but also a bug: if I show something to a computer it should remember what it saw!)

One way to view this is as a code vs data distinction: current models are 'all code' (in the sense that a transformer is a giant differentiable program), but many information-processing systems have relatively small code 'cores' that interact with large databases.

Off-the-top-of-my-head impressions of what improved systems might look like:

A system that can 'take notes' on a conversation and refer to these in future conversation. (could potentially do this with current models and appropriate prompt engineering).
A system with 'long-term context' that in fact extends to the entire training set (!). Maybe something hierarchical, similar to a Merkle tree but differentiable. This would require radically rethinking the training procedure.
Systems with latent 'lookup' actions allowing them to augment their context with the result of some database (or similar) query.

Papers of note (papers to read):

Retrieval transformer (Deepmind): https://jalammar.github.io/illustrated-retrieval-transformer/
- Database consists of sentence texts split into 'neighbor' and 'completion' chunks. A BERT sentence embedding from the neighbor chunk is the 'key'. A BERT embedding of the prompt context is the 'query'. The two nearest neighbors of the prompt are retrieved - both the neighbor and 'completion' chunks - and fed into the model.
- The key/query setup is just BERT embeddings, which are pretrained, so there's no issue with backpropping through a discrete lookup mechanism, and the keys for the entire training set can be precomputed.
Compressive transformer (Deepmind): https://www.deepmind.com/blog/a-new-model-and-dataset-for-long-range-memory
Retrieval augmentation: https://arxiv.org/abs/2104.07567
Large Memory Layers with Product Keys: https://arxiv.org/abs/1907.05242v1
Training language models with memory augmentation: https://arxiv.org/abs/2205.12674
Relational memory augmented language models: https://arxiv.org/abs/2201.09680

transformers with memory

Links to this note

expressive transformer

research idea

general intelligence

Meta