direct preference optimization: Nonlinear Function
Created: May 31, 2023
Modified: May 31, 2023

direct preference optimization

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

References:

This seems like a compelling reframing of reinforcement learning from human feedback. Instead of first training a reward model on preference data, then using RL methods to maximize the reward, we can derive an objective that implicitly maximizes the reward function without ever learning or representing it explicitly. In a sense it marginalizes out the reward function from the whole setup. This should be easier to tune and much simpler than RL training.

Derivation

Direct preference optimization

We can avoid this two-step process by rewriting the math to remove the reward function from the equation entirely! Generalizing the math from soft Q-learning to account for the KL regularization (TODO: go back and rewrite that page in the more general form), we see that the regularized objective is optimized by a policy π\pi^* that 'just' re-weights the reference policy following the exponentiated reward,

π(yx)=1Z(x)πref(yx)exp(1βrϕ(y,x)).\pi^*(y|x) = \frac{1}{Z(x)} \pi_\text{ref}(y | x) \exp\left(\frac{1}{\beta}r_\phi(y, x)\right).

Now we have an equation that relates the optimal policy π\pi^*, reference policy πref\pi_\text{ref}, and the reward function rϕr_\phi. The clever move is that we can now rearrange this to represent the reward implicitly in terms of the ratio of the two policies!

rϕ(y,x)=βlogπ(yx)πref(yx)+βlogZ(x)r_\phi(y, x) = \beta \log \frac{\pi^*(y | x)}{\pi_\text{ref}(y | x)} + \beta\log Z(x)

Here the reward is determined only up to an intractable normalizing constant, but fortunately this cancels out in the Bradley-Terry model! Plugging this in we get a new likelihood function in terms of a parameterized policy πθ\pi_\theta,

LDPO(πθ,πref)=Ex,ywyl[logp(ywyl)]=Ex,ywyl[σ(β(logπθ(ywx)πref(ywx)logπθ(ylx)πref(ylx)))]\begin{align*} \mathcal{L}_\text{DPO}(\pi_\theta, \pi_\text{ref}) &= \mathbb{E}_{x, y_w \succ y_l} \left[\log p(y_w \succ y_l)\right]\\ &= \mathbb{E}_{x, y_w \succ y_l} \left[\sigma\left(\beta \left(\log \frac{\pi_\theta(y_w | x)}{\pi_\text{ref}(y_w | x)} - \log \frac{\pi_\theta(y_l | x)}{\pi_\text{ref}(y_l | x)}\right)\right)\right] \end{align*}

This is way better than the original formulation.

Thoughts

The key insight is that, in KL-weighted RL, there is a direct correspondence between reward functions and policies. Every policy π\pi is the optimal policy for some reward function logππref\log \frac{\pi}{\pi_\text{ref}}, so we can optimize in policy-space instead of optimizing in reward-space.

Why was this not obvious?

It seems like we're no longer doing 'discrete' RL. We're not treating generation as a sequential decision problem - at this level we don't need to decompose the generation into individual steps and learn a value function at each step, the way we would for actor-critic RL. We're just trying to upweight good (entire) sequences and downweight bad ones. And this makes sense because the preference data we have is at the sequence level, not a granular per-step reward function.

Is it limiting that we train a policy only at the yy that we have preference data for? That will in general be a very small fraction of the space, but we're relying on this to generalize to other text we might generate. I guess the point is that this is all that was ever going to be present in the reward signal anyway. But in the RLHF setup, it's the architecture of the reward model that controls generalization (if the reward model is tabular, then preference learning will give us a reward function that is only nonzero at the observed points, so that RL will only update at those points anyway), whereas here it's the architecture of the policy model (which in turn determines the form of the implicit reward model).

Do the preferences need to be sampled from the reference model or can they be arbitrary?

How does this generalize to non-LM contexts? It seems like the basic paradigm would be applicable any time we want to train a reinforcement learning system to do a task:

  • First use behavioral cloning on human data to learn a reference policy
  • Allow that policy to act, and have humans rank its attempts

Are we limited to pairwise preference feedback? Apparently the approach generalizes to multi-way preferences under the Plackett-Luce model, which generalizes Bradley-Terry. What about binary (thumbs-up / thumbs-down) feedback, or star ratings, etc? I guess this wouldn't carry over directly since it's the relative nature of the feedback that allowed us to cancel the normalizing constant in the reward expression. But presumably there's stuff you could do, eg work in terms of algorithms that try to learn the normalizing constant, which might still be easier than learning the full reward function?

What if we have finer-grained preferences? If two responses are both good up to a point, where one of them makes a very clear mistake, we should be able to annotate that there is the mistake. Would this algorithm generalize to accept preference data at different levels of granularity? (e.g., response A is better than B, but also step 3 of response B is better than step 3 of response A). It seems like it should if you just treat the fragments as totally separate generations.

Does this work with AI feedback instead of human feedback (constitutional AI)? It should: all you need are preference ratings; where they come from is immaterial.

Reward uncertainty

Since we only observe preferences on a limited set of prompts, we can never hope to learn the reward function (or the roughly-equivalent "preference function", if you prefer) exactly. A potential issue with naive RLHF is that it doesn't preserve epistemic uncertainty about the reward function. Even if there isn't enough preference data to pin down the reward function, the two-step process will choose some reward function and then hyper-optimize it. That will lead to extreme and perhaps unaligned behavior, since what is optimal under the chosen reward might be terrible under other plausible rewards.

We would rather be conservative in finding behavior that is good for all, or at least most, of the reward functions consistent with the observed preferences. A natural way to frame this is that to optimize expected reward, where the expectation is taken over a posterior distribution p(r)p(r | \succ) of all reward functions rr consistent with our preferences \succ. Put differently, by taking the expectation we are marginalizing out the reward function.

Lmarginalized(θx,)=Eyπθ(x)Erp(rx,)[r(yx)].\mathcal{L}_\text{marginalized}(\theta | x, \succ) = \mathbb{E}_{y \sim \pi_\theta(\cdot|x)}\mathbb{E}_{r \sim p(r | x, \succ)}[r(y|x)].

Does DPO do this, or something like this?

The posterior p(rx,)p(r|x, \succ) would fall out from a prior on rewards and the likelihood given by the Bradley-Terry model:

p(rx,)p(r)p(r,x)=p(r)i=1np(ywiylir,x)=p(r)i=1nexp(r(ywix))exp(r(ywix))+exp(r(ylix))\begin{align*} p(r | x, \succ) &\propto p(r) \cdot p(\succ | r, x) \\ &= p(r) \cdot \prod_{i=1}^n p(y_{w_i} \succ y_{l_i} | r, x)\\ &= p(r) \cdot \prod_{i=1}^n \frac{\exp(r(y_{w_i} | x))}{\exp(r(y_{w_i} | x)) + \exp(r(y_{l_i} | x))} \end{align*}

In RLHF there's no explicit prior on rewards, but there is an implicit prior induced by the reward model architecture and by the reward-learning procedure.

Although there's no explicit prior on rewards, the entropy-regularized RL objective does in a sense put a prior on the optimal policy, namely, that it should be close to the reference policy πref\pi_\text{ref}.

Does it makes sense to map this into reward space? Instead of an objective to maximize reward, subject to being close to the reference policy, can we just have an objective to maximize a modified reward, where we add a term that rewards whatever the reference policy would do?

I think a distinction is that rewards are framed in terms of actions, while the regularization is framed in terms of distributions. If the reference policy is a uniform distribution over actions, then we're not rewarding any particular action (so there's no action-specific modification we can make to the reward), but we still penalize certainty.

So we treat the entropy regularization as