Created: February 21, 2022
Modified: February 21, 2022

value aligned language game

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Suppose I have an agent that generates text. I want it to generate text that is aligned with human values.

Approaches:

behavioral cloning: train the agent with a standard language-modeling objective, i.e., to mimic what it sees in a large corpus of text. Since the text was written by humans, it presumably represents human values. Why isn't this enough?
1. Not all text lives up to human values at their best.
2. It's unlikely that the model will learn a sufficiently abstract representation of values to generalize to previously unseen situations.
3. Suppose we want the model to help a specific human, e.g., its interlocutor. Even if we had lots of text written by the interlocutor (and in fact we might have very little), we wouldn't just want to mimic that text. For example, the best way to help the interlocutor might be to tell them a fact that they didn't previously know.
steering language models: have humans rate multiple responses sampled from the model. Build a model of these ratings, and then train the agent by policy gradient using the rating model as a critic. This reinforces likely-helpful replies, and suppresses unhelpful replies. Why isn't this enough?
1. Such a system obeys the values of the raters, but isn't incentivized to be curious about the specific values of its interlocutor.
2. If the ratings are at the level of individual sentences, they may be short-sighted. (e.g., an information-gathering question may not itself be directly helpful, but we would want to encourage gathering information as part of a longer-term plan to help). If they're at the level of longer interactions, the signal is quite sparse and the agent may do the wrong thing.
3. The human ratings model may give poor feedback when out-of-domain. The agent could even 'wirehead' by generating adversarial examples for the ratings model.
Steering with reward uncertainty. Use human ratings like before, but train an ensemble or other Bayesian approximation for the reward model. Take the action that maximizes expected utility under the model's uncertainty. Why isn't this enough?
1. The relevant expectation is not just under the reward model but also our uncertainty about the interlocutor's response.
2. As stated, this model still learns values from explicit human ratings of utterances. We also want it to learn from the text of the conversation with the interlocutor. For example, if the interlocutor says "Thank you so much; that was exactly what I needed to know", this should count as positive reinforcement for whatever was previously said.
  1. This could emerge naturally by training a critic/value model against a final reward. The value model would naturally learn that utterances like 'thank you' are predictive of ultimate satisfaction.
Model-based planning. The agent represents its interlocutor's values in some vector space (which could just be the weights of a reward model). Each conversational interaction gives us some information about these; it narrows down the possibilities. When the agent considers a sentence, it:
1. Simulates possible reactions by the interlocutor.
2. For each possible reaction, updates the model of the interlocutor's values.
3. Considers possible next sentences, scores them by the imagined updated model, and focuses MCTS-style on the most probable.
4. Considers possible reactions of the interlocutor to each next sentence (given the interaction so far), and update the model further.
5. And so on.
- Challenges of this approach:
  - Requires models of the interlocutor's actions and values, and uncertainty over both.
  - The system is incentivized to find adversarial examples for the interlocutor model.
  - Search quickly becomes intractable.
Fictitious play. We simulate interactions between the model and another model acting as the interlocutor. Somehow we define a reward model for these interactions (this need not and should not be zero-sum since ultimately we're trying to align the incentives). Concretely, maybe we say that the interlocutor has one of K different reward models, and the interaction will be judged by whichever the true one is. Instead of explicit model-based planning, we try to learn a network that amortizes doing the right thing. This is essentially MuZero for language games. Challenges:
1. Like the explicit planning approach, this requires a model of the interlocutor and is incentivized to find adversarial examples for that model.

Literature review: is there work on transformers playing games with each other?

Toy problem:

Create models of several interlocutors by
- fine-tuning a LM on text by specific individuals
- somehow also train a reward model on conversations for each individual
- train an agent LM as defined in #5
- show that it effectively learns and optimizes its interlocutor's reward.

value aligned language game

Links to this note

value learning

Meta