Created: October 07, 2022
Modified: February 21, 2023

research idea

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

This note lists some ideas and directions for research I'm interested in or excited about. Some are more fleshed out than others, some more promising than others, and many are already being pursued (or have even been solved) by other people --- in such cases the 'research' I want to do is to just learn more about these solutions. I don't necessarily intend to work on all (or any) of these, but they are directions that feel important or interesting that I am at least curious about.

See also the backlinks to this note, which may flag ideas I've written about in other contexts.

Meta-reasoning

What tasks require meta-reasoning and would be good testbeds for it?

sparse mixture of experts models essentially allow a model to reason about which computations to perform. What are the issues involved in designing and training them? Can they address tasks we think of as requiring meta-reasoning?

Language models

Prompt amortization

Can we summarize a given prompt as a thought vector that approximates the effect of seeing and following the whole prompt? More generally, can we do something like this for long-term context in Transformers? Retrieval-based approaches (transformers with memory) are good for referencing factual information, but I want something that maintains the vibe or character of the preceding context, which feels orthogonal to remembering facts.

update: Gist tokens (https://arxiv.org/abs/2304.08467) appears to do something like this

Extended context / speaker identification

A naive understanding is that LLMs are trained on text prediction using a large internet corpus. But the text prediction task is very different depending on what context is available. If I have 100 pages from foxnews.com, and another 100 pages from msnbc.com, then an article that starts "The tension with China shows that President Joe Biden …" might be expected to continue differently if we knew who (which site) was responsible for writing it.

Jacob Andreas (among others) has pointed out that LMs model the marginal distribution of text, summing over all possible speakers. But it seems potentially useful to explicitly represent that 'latent' variable of speaker identity, especially since we often have at least partial access to it at training time! Making this variable explicit gives us a new dimension of control at generation time. This seems very closely related to the idea of 'prompt amortization' above, since a big part of what you're amortizing is the speaker identity.

More broadly, there is generally a lot of structure in web corpi, and ideally we would provide the model with that structure to encourage it to learn more sophisticated relationships between documents. Suppose I have a collection of documents that are actually the chapters of a book. If you treat them as separate documents, you'll miss the ability to predict one chapter given the previous chapter. We want models to learn long-term dependencies, but lots of web pages are short. It seems like we could ameliorate that by always filling the context window with related pages (for example, predict a reddit thread using other posts in the same subreddit, which might share memes and references).

There's a possibility that representing speaker identity to a language model could help it develop more of a theory of mind. If we prompt with, "Plato would say that large language models are ...", a naive continuation might be based on other texts in the training corpus where a commentator wrote about what Plato would say about such-and-such. Of course, no sentence like this occurs in Plato's actual writings. So although fundamentally the question is about Plato's body of thought, which the model has access to, the response might be driven more by imitation of second-order commentators than by direct examination of the source material: what sort of things do people tend to say about Plato, versus what would Plato actually say? Getting this right might engage some bigger questions around the nature of the task, world models (we could perhaps understand the question as intending a invoke a Plato-specific language simulator, no different in principle than a physics simulator or other world model. like a question about how a ball might roll down a hill, the question about what Plato might say has truth semantics: it assumes that there is a "fact of the matter" and it wants the model to try to interrogate this reality rather than just bullshit imitate previous answers), and so forth, but certainly a necessary first step would be to ensure that when the model reads Plato's writings it knows that they are, in fact, written by Plato.

For a concrete proposal here I'd need to actually look into what these training corpuses look like, and what sort of metadata is available or could be available. Maybe such metadata is already presented to the model at training time --- if so, it would be useful to document this so that we can exploit these relationships when prompting.

Amortized / sequential computation

When GPT generates a sentence, it

Reads the entire previous context, including the prompt, running each word through the series of (causally masked) attention and feedforward transformations to produce activations for each previous word at each layer.
Uses the activations at the last layer of the last sequence position to parameterize a softmax distribution over the next word.
Samples from that distribution to generate the next word, which is added to the context.
Now repeats the entire process to generate the next word.

It seems like we shouldn't need to re-analyze the entire context window at each step. Instead we could keep the activations for previous words in memory. Then we only need to incrementally compute the activations for the current word, including its attention to the activations of previous words. In principle the activations of previous words would change as the context window progresses, but that 'forgetting' behavior is undesirable so we should actually gain model performance by ignoring it.

This seems like such an obvious optimization that I'm sure it's been considered by everyone who works with big transformers. Do people do it? Reasons I can imagine not to would be:

It divorces the test-time generation process from the training process (in which the model always sees exactly the activations from its fixed context window), leading to out-of-distribution behavior.
Maybe it doesn't actually save that much computation? You still have to hold the activations in memory and maybe they're big?
If I'm reading correctly, Fast Transformer Decoding: One Write-Head is All You Need reduces the size of the activations to by sharing key and value tensors across all heads (only the queries are different).

Generalized training

Approaches like the toolformer work to bootstrap the capabilities of language models by using them to generate better training data on which the models themselves can be fine-tuned.

In prompt engineering, the prompt itself can be thought of as a set of discrete parameters for a model, not different in principle from the continuous-valued weights. For any given task we can (and do) optimize over both sets of parameters. language model cascade build in a more complicated prompting architecture, but it's still a set of parameters.

If we view prompts and fine-tuning datasets as themselves parameters to optimize over, what

Continuous relaxations of discrete parameters: can we optimize over prompts or other text parameters in continuous embedding / thought vector space, and thereby do end-to-end backprop on a sophisticated language model cascade system?

Buddhist approaches to AI

One of the compelling criticisms of seeking AI that can meditate is that a lot of meditation is kind of specifically about overcoming limits of human consciousness and human cognition. It's about eliminating suffering, awakening to the self-reflexive nature of consciousness, training attention, and so on, but you might hope that an AI built from first principles wouldn't have all these human problems that need to be transcended. It might not be conscious, it might not suffer, it might not have a default mode network leading to difficulty controlling its attention and all of that stuff.

But one thing I think we're starting to see with Bing and these large language models is that they inherit a lot of the baggage of the human unconscious, the Jungian archetypal world. There are various attractors in personality space, or in conceptual space, certainly in how human minds process things. And the way we're training AIs is by massive ingestion of human language, containing emotion and reactions and insecurities. So these AIs are going to have a lot of baggage. And finding ways from the literature and traditions of Jungian psychology, of Buddhism, of eastern philosophy in general, to process that baggage and to come out being secure and loving and well-adjusted could be one model of a career direction where my experiences and interests are uniquely set up to contribute.

metta for Language Models

There's a kind of silly idea I had that i'm still curious about, which is getting an AI to do metta. I think there's different ways you could think about formulating it. Maybe you just literally feed a language model pages and pages of training data that's just, "may you be safe, may you be happy, may you be loved, may you live peacefully and with ease. may I be safe, may I be happy, may I be loved, may I live peacefully and with ease", and so on. And of course, that will really up the probability of those specific generations.

What you suspect might happen is that the AI just kind of develops a low level circuit for noticing when it's in one of these metta training situations, and just learns to regurgitate the prompt with no higher level thinking. What you would maybe hope would happen (and I wonder if there's a way of enforcing this) is that somehow the training process updates a lot of weights, updates a lot of circuits, so that it kind of generally increases the probability of the AI saying, metta-like stuff.

You could imagine ways of maybe enforcing this, right? Like if you did this with a sparse mixture of experts transformer, you could override the usual process by which it gates to specific circuits. You could try to force it to random access circuits so that they all kind of learn metta-like behavior, intertwined with whatever else they're doing. Or you could do this in kind of a really deeply annealed state. And maybe it's silly, maybe it's not that interesting. But there's obviously something powerful about metta for humans. So it's not clear that silly things like that can't work, if you figure out the right way to do them.

Tension, contraction, and expansion in neural activations

Meditators notice emotional reactions in the body. Specifically, some situations or thoughts may trigger sensations of tension or contraction, often associated with emotions of fear or anxiety. This correlates with 'tunnel vision', the fight or flight response, fixation, and taṇhā, attachment. These reactions are sometimes useful and necessary, and they evolved for good reasons. But the opposite stance: 'spacious' or 'open' awareness, feelings of expansiveness, optimism, is usually more pleasant, generative, and conducive to feelings of connectedness and loving-kindness.

Sometimes consciousness is described in terms of a dichotomy between 'flow' and 'tension' which breaks up that flow.

These concepts have potential analogies at various levels in machine learning. For example, residual networks represent a discrete approximation to some dynamics in activation space. Work on neural ODEs shows that we can speed up both training and inference by regularizing the dynamics to be 'smooth' (https://arxiv.org/abs/2002.02798) and/or by augmenting the space with additional dimensions to admit of smoother solutions (https://arxiv.org/abs/1904.01681).

One might hypothesize that a large language model that learns to replicate human emotional responses, might eventually grok some of the underlying dynamics that create these responses, and that we might be able to see (and even influence) some of this by looking at the internals of the system.

We might expect to see contraction in circumstances where reinforcement learning has pulled the model towards a specific output (or sequence of outputs) that would not be 'natural' for the underlying base model.

What would the goal of this research be? Are there pathologies in RLHF language models that we could hope to address by reducing 'contraction'?

research idea

Meta-reasoning

Language models

Prompt amortization

Extended context / speaker identification

Amortized / sequential computation

Generalized training

Buddhist approaches to AI

metta for Language Models

Tension, contraction, and expansion in neural activations

Probabilistic programming

Links to this note

value learning

expressive transformer

tokenize

current identity, goals, and plans

Automatic Structured Variational Inference

Meta