Modified: July 21, 2022
proximal policy optimization
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.references:
- paper: https://arxiv.org/abs/1707.06347
- great blog post on implementation details: https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/
Related to trust region policy optimization, but simpler and possibly it works better. PPO updates the policy using the autodiff gradient of the surrogate objective
where
are importance weights. Why is this reasonable? First note that the gradient of the importance weight introduces a score function term,
so at the first step where (thus we have and the clipping is moot) we simply recover the usual policy gradient,
On subsequent steps where , we still end up with the (importance-weighted) policy gradient objective as long as the importance weights don't leave the clipping region , i.e., as long as we haven't changed the probability of this current action by too much. When the action probability does change by so much that the importance weights get clipped, there are two possibilities:
- The change could be advantageous: it's making the objective go up, by reinforcing a positive-advantage action or suppressing a negative-advantage action.
- The change is disadvantageous, making the objective go down. This case is less likely because gradient ascent will 'try' to produce advantageous changes overall, but a parameter update that improves the policy overall may still worsen it in specific cases.
The clipping in PPO is a restriction on advantageous changes: once each action probability is updated by a relative factor of in the appropriate direction, the objective has no incentive to make any further updates (in particular, clipping means the gradients will literally be zero). On the other hand, disadvantageous updates are still fully penalized because we take the minimum of the clipped and unclipped objectives (using only the clipped objective, it would be impossible to recover from a step that accidentally made things worse, since past that point the gradients would disappear).
The clipping is morally similar to a constraint that no action's probability can change by more than (which would be a literal trust region), but instead of enforcing a hard constraint, PPO simply switches off the gradients that would tend to cause the constraint to be violated. Note that it's possible for PPO to update a probability by more than , either by overshooting with a large step size, or as an accidental consequence of parameter updates aimed at other terms, but in the limit of an infinite-capacity policy updated with infinitesimally small steps it would recover the behavior of the hard constraint.