Created: July 17, 2021
Modified: July 18, 2021

steering language models

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Getting language models to align their output with human preferences would be highly useful for computational life coaching. What's the state of the art?
Learning to summarize from human feedback (OpenAI): starting with an transformer language model, they try to get it to produce summaries (tl;dr's) of Reddit posts. They consider a zero-shot approach (where the prompt includes a couple of examples of good summaries), supervised fine-tuning on a dataset of summaries, and then RL with a 'reward model'. The reward model is trained to classify which of two summaries is human-produced; it outputs a scalar value that is used as a logit. The language model is then trained, via PPO, to maximize reward minus a KL penalty to the supervised weights (to prevent it from going too far).
- Could we do the reward maximization differentiably?
  - Words are discrete, of course. But presumably you could tie the LM and the reward model to use the same word embeddings at the last layer, and then differentiate through them.
- Would it help to have uncertainty in the reward model?
  - We're maximizing expected reward. So if the reward model had epistemic uncertainty, we would want to integrate over that. This should help prevent overfitting to the reward model.
- How does the PPO training work?

steering language models

Links to this note

reinforcement learning from human feedback

large control policies

safe objective

value aligned language game

reinforcement learning

Meta