Created:
Modified:

reinforcement learning from human feedback

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

see: steering language models, direct preference optimization

We are given a bunch of pairwise preference evaluations, of the form

y_1 \succ y_2 | x,

which we take to be generated by some underlying reward function $r$ . In particular, we typically use the Bradley-Terry model, which is exactly the simplest thing you'd inventPerhaps after discarding a few really simpler things like "Always prefer the higher-reward outcome" that can't model noisy/inconsistent preferences. This is equivalent to moving from binary classification with 0-1 loss to logistic regression; if you're comfortable with that move then the Bradley-Terry model is just the 'obvious' application in this domain.: the log-odds of a preference are just the difference in rewards,

\begin{align*} p(y_1 \succ y_2 | x) &= \sigma\left(r(y_1 | x) - r(y_2 | x)\right)\\ &= \frac{\exp\left(r(y_1 | x)\right)}{\exp\left(r(y_1 | x)\right) + \exp\left(r(y_2 | x)\right)}. \end{align*}

Classically we would then parameterize a reward model $r_\phi$ and optimize for $\phi$ that maximizes the likelihood of the observed preferences. Note that the reward is only learnable up to a (prompt-specific) constant because the likelihood depends only on the difference in rewards at a given prompt $x$ . Once we have the reward, we then use a reinforcement learning algorithm (like proximal policy optimization) to maximize a regularized objective

\mathbb{E}_{x, y \sim \pi_\theta} \left[r_\phi(y, x) - \beta \mathcal{D}_{KL} \left(\pi_\theta(y|x) \| \pi_\text{ref}(y | x)\right)\right].

This objective tries to maximize reward without diverging too far from a reference policy $\pi_\text{ref}$ (typically the 'supervised fine-tuning' ie behavioral cloning policy). The KL regularization is closely related to the entropy regularization in maximum-entropy reinforcement learning, so the same math and algorithms go through.

reinforcement learning from human feedback

Links to this note

direct preference optimization

Meta