reinforcement learning from human feedback: Nonlinear Function
Created:
Modified:

reinforcement learning from human feedback

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

see: steering language models, direct preference optimization

We are given a bunch of pairwise preference evaluations, of the form

y1y2x,y_1 \succ y_2 | x,

which we take to be generated by some underlying reward function rr. In particular, we typically use the Bradley-Terry model, which is exactly the simplest thing you'd inventPerhaps after discarding a few really simpler things like "Always prefer the higher-reward outcome" that can't model noisy/inconsistent preferences. This is equivalent to moving from binary classification with 0-1 loss to logistic regression; if you're comfortable with that move then the Bradley-Terry model is just the 'obvious' application in this domain.: the log-odds of a preference are just the difference in rewards,

p(y1y2x)=σ(r(y1x)r(y2x))=exp(r(y1x))exp(r(y1x))+exp(r(y2x)).\begin{align*} p(y_1 \succ y_2 | x) &= \sigma\left(r(y_1 | x) - r(y_2 | x)\right)\\ &= \frac{\exp\left(r(y_1 | x)\right)}{\exp\left(r(y_1 | x)\right) + \exp\left(r(y_2 | x)\right)}. \end{align*}

Classically we would then parameterize a reward model rϕr_\phi and optimize for ϕ\phi that maximizes the likelihood of the observed preferences. Note that the reward is only learnable up to a (prompt-specific) constant because the likelihood depends only on the difference in rewards at a given prompt xx. Once we have the reward, we then use a reinforcement learning algorithm (like proximal policy optimization) to maximize a regularized objective

Ex,yπθ[rϕ(y,x)βDKL(πθ(yx)πref(yx))].\mathbb{E}_{x, y \sim \pi_\theta} \left[r_\phi(y, x) - \beta \mathcal{D}_{KL} \left(\pi_\theta(y|x) \| \pi_\text{ref}(y | x)\right)\right].

This objective tries to maximize reward without diverging too far from a reference policy πref\pi_\text{ref} (typically the 'supervised fine-tuning' ie behavioral cloning policy). The KL regularization is closely related to the entropy regularization in maximum-entropy reinforcement learning, so the same math and algorithms go through.