proximal policy optimization: Nonlinear Function
Created: July 06, 2022
Modified: July 21, 2022

proximal policy optimization

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

references:

Related to trust region policy optimization, but simpler and possibly it works better. PPO updates the policy using the autodiff gradient of the surrogate objective

JPPO(θ)=Es,aπθoldmin(wθAπθold(s,a),clip(wθ,1ϵ,1+ϵ)Aπθold(s,a))\begin{align*} J_\text{PPO}(\theta) = \mathbb{E}_{s, a\sim \pi_{\theta_\text{old}}}\min&\left( w_\theta A^{\pi_{\theta_\text{old}}}(s, a),\right.\\ &\left.\text{clip}\left(w_\theta , 1-\epsilon, 1 + \epsilon\right)A^{\pi_{\theta_\text{old}}}(s, a)\right) \end{align*}

where

wθ=πθ(as)πθold(as)w_\theta = \frac{\pi_{\theta}(a | s)}{\pi_{\theta_\text{old}}(a | s)}

are importance weights. Why is this reasonable? First note that the gradient of the importance weight introduces a score function term,

θwθ=wθθlogπ(as),\nabla_\theta w_\theta = w_\theta \nabla_\theta \log \pi(a | s),

so at the first step where θ=θold\theta = \theta_\text{old} (thus we have wθ=1w_\theta = 1 and the clipping is moot) we simply recover the usual policy gradient,

θJPPO(θ)=Es,aπθoldθ(wθAπθold(s,a))=Es,aπθold(θlogπθ(as))Aπθold(s,a).\begin{align*} \nabla_\theta J_\text{PPO}(\theta) &= \mathbb{E}_{s, a\sim \pi_{\theta_\text{old}}} \nabla_\theta\left(w_\theta A^{\pi_{\theta_\text{old}}}(s, a)\right)\\ &= \mathbb{E}_{s, a\sim \pi_{\theta_\text{old}}} \left(\nabla_\theta \log \pi_\theta(a | s)\right) A^{\pi_{\theta_\text{old}}}(s, a). \end{align*}

On subsequent steps where θθold\theta \ne \theta_\text{old}, we still end up with the (importance-weighted) policy gradient objective as long as the importance weights don't leave the clipping region [1ϵ,1+ϵ][1-\epsilon, 1 + \epsilon], i.e., as long as we haven't changed the probability of this current action aa by too much. When the action probability does change by so much that the importance weights get clipped, there are two possibilities:

  1. The change could be advantageous: it's making the objective go up, by reinforcing a positive-advantage action or suppressing a negative-advantage action.
  2. The change is disadvantageous, making the objective go down. This case is less likely because gradient ascent will 'try' to produce advantageous changes overall, but a parameter update that improves the policy overall may still worsen it in specific cases.

The clipping in PPO is a restriction on advantageous changes: once each action probability is updated by a relative factor of ϵ\epsilon in the appropriate direction, the objective has no incentive to make any further updates (in particular, clipping means the gradients will literally be zero). On the other hand, disadvantageous updates are still fully penalized because we take the minimum of the clipped and unclipped objectives (using only the clipped objective, it would be impossible to recover from a step that accidentally made things ϵ\epsilon worse, since past that point the gradients would disappear).

The clipping is morally similar to a constraint that no action's probability can change by more than ϵ\epsilon (which would be a literal trust region), but instead of enforcing a hard constraint, PPO simply switches off the gradients that would tend to cause the constraint to be violated. Note that it's possible for PPO to update a probability by more than ϵ\epsilon, either by overshooting with a large step size, or as an accidental consequence of parameter updates aimed at other terms, but in the limit of an infinite-capacity policy updated with infinitesimally small steps it would recover the behavior of the hard constraint.