Created: September 03, 2022
Modified: September 03, 2022
Modified: September 03, 2022
transformers with memory
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.Incorporating explicit memory and retrieval seems pretty clearly like the next frontier in language modeling and AI more broadly. We have systems like GPT-3 that view the entire internet and store whatever they learn in some giant associative memory trained via SGD. But these are imperfect in several ways.
- Short context windows: can't generate long structured text (novels, etc) or have long coherent conversations. (see long-term context in Transformers).
- Updatability. The only way to teach a new fact is to include it in a prompt (reducing to the previous problem of prompt length limitations) or to do lots of expensive fine-tuning.
- Uninterpretable: we have no way to explicitly query the model's "beliefs" or to attribute an answer to a particular belief (and indeed with appropriate prompting these models can espouse conflicting beliefs - this is 'humanlike' and not necessarily a bug as such, but problematic in some cases).
- Inefficient: we're storing and using all of those 175B parameters even for tasks/topics that implicate only a small subset of the training set.
- Unsatisfying distinction between training time updates (SGD) and test-time 'updates' in context window (not the ideal meta-level shape of machine learning).
- One-pass SGD training can fail to memorize training set (a feature, but also a bug: if I show something to a computer it should remember what it saw!)
One way to view this is as a code vs data distinction: current models are 'all code' (in the sense that a transformer is a giant differentiable program), but many information-processing systems have relatively small code 'cores' that interact with large databases.
Off-the-top-of-my-head impressions of what improved systems might look like:
- A system that can 'take notes' on a conversation and refer to these in future conversation. (could potentially do this with current models and appropriate prompt engineering).
- A system with 'long-term context' that in fact extends to the entire training set (!). Maybe something hierarchical, similar to a Merkle tree but differentiable. This would require radically rethinking the training procedure.
- Systems with latent 'lookup' actions allowing them to augment their context with the result of some database (or similar) query.
Papers of note (papers to read):
- Retrieval transformer (Deepmind): https://jalammar.github.io/illustrated-retrieval-transformer/
- Database consists of sentence texts split into 'neighbor' and 'completion' chunks. A BERT sentence embedding from the neighbor chunk is the 'key'. A BERT embedding of the prompt context is the 'query'. The two nearest neighbors of the prompt are retrieved - both the neighbor and 'completion' chunks - and fed into the model.
- The key/query setup is just BERT embeddings, which are pretrained, so there's no issue with backpropping through a discrete lookup mechanism, and the keys for the entire training set can be precomputed.
- Compressive transformer (Deepmind): https://www.deepmind.com/blog/a-new-model-and-dataset-for-long-range-memory
- Retrieval augmentation: https://arxiv.org/abs/2104.07567
- Large Memory Layers with Product Keys: https://arxiv.org/abs/1907.05242v1
- Training language models with memory augmentation: https://arxiv.org/abs/2205.12674
- Relational memory augmented language models: https://arxiv.org/abs/2201.09680