Created: July 06, 2022
Modified: July 21, 2022

proximal policy optimization

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

references:

paper: https://arxiv.org/abs/1707.06347
great blog post on implementation details: https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/

Related to trust region policy optimization, but simpler and possibly it works better. PPO updates the policy using the autodiff gradient of the surrogate objective

\begin{align*} J_\text{PPO}(\theta) = \mathbb{E}_{s, a\sim \pi_{\theta_\text{old}}}\min&\left( w_\theta A^{\pi_{\theta_\text{old}}}(s, a),\right.\\ &\left.\text{clip}\left(w_\theta , 1-\epsilon, 1 + \epsilon\right)A^{\pi_{\theta_\text{old}}}(s, a)\right) \end{align*}

where

w_\theta = \frac{\pi_{\theta}(a | s)}{\pi_{\theta_\text{old}}(a | s)}

are importance weights. Why is this reasonable? First note that the gradient of the importance weight introduces a score function term,

\nabla_\theta w_\theta = w_\theta \nabla_\theta \log \pi(a | s),

so at the first step where $\theta = \theta_\text{old}$ (thus we have $w_\theta = 1$ and the clipping is moot) we simply recover the usual policy gradient,

\begin{align*} \nabla_\theta J_\text{PPO}(\theta) &= \mathbb{E}_{s, a\sim \pi_{\theta_\text{old}}} \nabla_\theta\left(w_\theta A^{\pi_{\theta_\text{old}}}(s, a)\right)\\ &= \mathbb{E}_{s, a\sim \pi_{\theta_\text{old}}} \left(\nabla_\theta \log \pi_\theta(a | s)\right) A^{\pi_{\theta_\text{old}}}(s, a). \end{align*}

On subsequent steps where $\theta \ne \theta_\text{old}$ , we still end up with the (importance-weighted) policy gradient objective as long as the importance weights don't leave the clipping region $[1-\epsilon, 1 + \epsilon]$ , i.e., as long as we haven't changed the probability of this current action $a$ by too much. When the action probability does change by so much that the importance weights get clipped, there are two possibilities:

The change could be advantageous: it's making the objective go up, by reinforcing a positive-advantage action or suppressing a negative-advantage action.
The change is disadvantageous, making the objective go down. This case is less likely because gradient ascent will 'try' to produce advantageous changes overall, but a parameter update that improves the policy overall may still worsen it in specific cases.

The clipping in PPO is a restriction on advantageous changes: once each action probability is updated by a relative factor of $\epsilon$ in the appropriate direction, the objective has no incentive to make any further updates (in particular, clipping means the gradients will literally be zero). On the other hand, disadvantageous updates are still fully penalized because we take the minimum of the clipped and unclipped objectives (using only the clipped objective, it would be impossible to recover from a step that accidentally made things $\epsilon$ worse, since past that point the gradients would disappear).

The clipping is morally similar to a constraint that no action's probability can change by more than $\epsilon$ (which would be a literal trust region), but instead of enforcing a hard constraint, PPO simply switches off the gradients that would tend to cause the constraint to be violated. Note that it's possible for PPO to update a probability by more than $\epsilon$ , either by overshooting with a large step size, or as an accidental consequence of parameter updates aimed at other terms, but in the limit of an infinite-capacity policy updated with infinitesimally small steps it would recover the behavior of the hard constraint.

proximal policy optimization

Links to this note

reinforcement learning from human feedback

Meta