Created: July 05, 2022
Modified: July 06, 2022

trust region policy optimization

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

(notes loosely based on the Berkeley deep RL course lecture)

Setup: RL with policy gradients

The basic setup is that we want to optimize the RL objective $J(\theta) = \mathbb{E}_{\tau\sim \pi_\theta} R(\tau)$ by following Monte Carlo estimates of the policy gradient. Given a sampled trajectory $\tau \sim \pi_\theta$ , we can define an unbiased estimate

\Delta_\tau(\theta) = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t | s_t) \sum_{t'=t}^{T-1} \gamma^{t'} r_{t'+ 1}

giving the update

\theta' \leftarrow \theta + \alpha \Delta_\tau(\theta)

Since the gradient estimate is unbiased, $\mathbb{E}_{\tau\sim\pi_\theta} \Delta_\tau(\theta) = \nabla_\theta J(\theta)$ , this defines a valid SGD procedure that will (under appropriate conditions) converge to a local maximum in policy space.

But after going to all the effort to collect that experience it seems inefficient to then just throw it away after a single gradient step. We are tempted to try to re-use the same experience to estimate a new 'gradient' update (or sequence of such updates)

\theta'' \leftarrow \theta' + \alpha \Delta_\tau(\theta')

but this does not work in general! We're no longer in the on-policy regime --- the policy $\pi_{\theta'}$ we're trying to update is no longer the same policy that collected our experience $\tau$ --- so the policy gradient theorem no longer applies. The step $\nabla_\tau(\theta')$ may make our policy worse.

When can we re-use experience?

Using the advantage formulation of the RL objective we can write the improvement in value from new parameters $\theta'$ as the expectation

\begin{align*} J(\theta') - J(\theta) &\propto \mathbb{E}_{s_t \sim d_{\pi_\theta', \gamma}}\left[\mathbb{E}_{a_t \sim \pi_\theta'(s_t)}\left[A^{\pi_\theta}(s_t, a_t)\right]\right]\\ &= \mathbb{E}_{s_t \sim d_{\pi_\theta', \gamma}}\left[\mathbb{E}_{a_t \sim \pi_\theta(s_t)}\left[\frac{\pi_\theta'(s_t)}{\pi_\theta(s_t)}A^{\pi_\theta}(s_t, a_t)\right]\right] \end{align*}

in which we can use importance sampling to correct for the expectation being taken under the old action distribution. Unfortunately this trick doesn't work for the state visitation distribution, which we have only empirical access to through the environment. So if we want to optimize this objective with respect to $\theta'$ we have to hope that the state distribution under $\pi_\theta'$ isn't too different from that under $\pi'$ .

Bounding the error in the objective

Suppose we knew that our new policy is close to the previous policy in the sense of total variation distance,

|\pi_{\theta'}(a | s) - \pi_{\theta}(a | s)| \le \epsilon\;\;\forall s,a.

It follows that there exists a joint distribution on the actions of the two policies in which the new policy takes the same action as the old policy with probability at least $1 - \epsilon$ at each step. In particular, we can decompose the occupancy probability at time $t$ into the cases where our $1-\epsilon$ coinflip works out (and we definitely do the same thing as $\pi_\theta$ ) and the cases where it doesn't (where we suppose anything can happen):

\pi'(s_t) = (1 - \epsilon)^t \pi_\theta(s_t) + \left(1 - (1-\epsilon)^t\right)p_\text{mistake}(s_t)

implying a total variation bound on the state occupancies

\begin{align*} |\pi_{\theta'}(s_t) - \pi_\theta(s_t)| &\le \left(1 - (1-\epsilon)^t\right)|p_\text{mistake}(s_t) - \pi_\theta(s_t)|\\ &\le \left(1 - (1-\epsilon)^t\right)\\ &\le \epsilon t \end{align*}

Now we use a general bound for expectations under total variation distance,

\begin{align*} \mathbb{E}_{\pi_{\theta'}}f(s) &= \int \pi_{\theta'}(s) f(s)ds\\ &= \int \pi_{\theta}(s)f(s)ds + \int \left(\pi_{\theta'}(s) - \pi_{\theta}(s)\right) f(s)ds\\ &\ge \int \pi_{\theta}(s)f(s)ds - \int \left|\pi_{\theta'}(s) - \pi_{\theta}(s)\right| |f(s)|ds\\ &\ge \int \pi_{\theta}(s)f(s)ds - 2\left|\pi_{\theta'}(s) - \pi_{\theta}(s)\right| \cdot \max_s |f(s)| \end{align*}

In the specific case of interest, we're taking an expectation over advantages $A$ , which as sums of rewards are bounded by $T\cdot r_\text{max}$ , so we have

\begin{align*} \sum_{t=0}^{T-1}&\mathbb{E}_{s_t \sim \pi_{\theta'}}\mathbb{E}_{a_t \sim \pi_\theta}\left[\frac{\pi_{\theta'}(a_t | s_t)}{\pi_\theta(a_t | s_t)} A^{\pi_\theta}(s_t, a_t)\right]\\ &\ge \sum_{t=0}^{T-1}\mathbb{E}_{s_t, a_t \sim \pi_{\theta}}\left[\frac{\pi_{\theta'}(a_t | s_t)}{\pi_\theta(a_t | s_t)} A^{\pi_\theta}(s_t, a_t)\right] - \sum_t 2 \epsilon t Tr_\text{max} \end{align*}

The quantity on the right, which we can compute, is a lower bound for the quantity on the left, which we want to maximize. So we can proceed by maximizing the lower bound! But we need to remember that the bound is valid only for policies $\pi_{\theta'}$ that are within a total variation distance $\epsilon$ of the collection policy $\pi_\theta$ :

\max_a |\pi_{\theta'}(a | s) - \pi_{\theta}(a | s)| \le \epsilon\;\;\forall s.

Thus we are left with a constrained optimization problem.

Optimization

We can't easily compute the total variation distance between two policies directly, but we can bound it by the KL divergence, which we typically can compute:

\max_x |p(x) - q(x)| \le \sqrt{\frac{1}{2}\mathcal{D}_\text{KL}(p\|q)}

so that a policy $\pi_\theta'$ satisfying the KL constraint

\sqrt{\frac{1}{2}\mathcal{D}_\text{KL}(\pi_{\theta'}(\cdot | s) \| \pi_{\theta}(\cdot | s))} \le \epsilon\;\;\forall s.

is sufficient for our bound to hold.

Now we can construct a Lagrangian

\mathcal{L}(\theta', \lambda) = \sum_{t=0}^{T-1}\mathbb{E}_{s_t, a_t \sim \pi_{\theta}}\left[\frac{\pi_{\theta'}(a_t | s_t)}{\pi_\theta(a_t | s_t)} A^{\pi_\theta}(s_t, a_t)\right] - \lambda\left(\mathcal{D}_\text{KL}(\pi_{\theta'}(\cdot | s) \| \pi_{\theta}(\cdot | s)) - 2 \epsilon^2\right)

and follow a dual gradient ascent strategy alternating steps in $\lambda$ and in $\theta$ . Alternatively, we can just fix $\lambda$ (rather than trying to adapt it to chase a fixed $\epsilon$ ) and use the preceding theory to justify following the KL-regularized objective.

Interpretation as natural gradient

Recall that natural gradient descent choose the descent direction that minimizes the objective within a KL-ball of the current point.

Limitations

Because we need to compute KL divergence, the policy distribution must be available in closed form. This precludes using dropout or other sampling inside of the policy network.

We also can't share parameters between the policy and value networks. Why?

trust region policy optimization

Setup: RL with policy gradients

When can we re-use experience?

Bounding the error in the objective

Optimization

Interpretation as natural gradient

Limitations

Links to this note

proximal policy optimization

policy gradient

Meta