trust region policy optimization: Nonlinear Function
Created: July 05, 2022
Modified: July 06, 2022

trust region policy optimization

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

(notes loosely based on the Berkeley deep RL course lecture)

Setup: RL with policy gradients

The basic setup is that we want to optimize the RL objective J(θ)=EτπθR(τ)J(\theta) = \mathbb{E}_{\tau\sim \pi_\theta} R(\tau) by following Monte Carlo estimates of the policy gradient. Given a sampled trajectory τπθ\tau \sim \pi_\theta, we can define an unbiased estimate

Δτ(θ)=t=0T1θlogπθ(atst)t=tT1γtrt+1\Delta_\tau(\theta) = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t | s_t) \sum_{t'=t}^{T-1} \gamma^{t'} r_{t'+ 1}

giving the update

θθ+αΔτ(θ)\theta' \leftarrow \theta + \alpha \Delta_\tau(\theta)

Since the gradient estimate is unbiased, EτπθΔτ(θ)=θJ(θ)\mathbb{E}_{\tau\sim\pi_\theta} \Delta_\tau(\theta) = \nabla_\theta J(\theta), this defines a valid SGD procedure that will (under appropriate conditions) converge to a local maximum in policy space.

But after going to all the effort to collect that experience it seems inefficient to then just throw it away after a single gradient step. We are tempted to try to re-use the same experience to estimate a new 'gradient' update (or sequence of such updates)

θθ+αΔτ(θ)\theta'' \leftarrow \theta' + \alpha \Delta_\tau(\theta')

but this does not work in general! We're no longer in the on-policy regime --- the policy πθ\pi_{\theta'} we're trying to update is no longer the same policy that collected our experience τ\tau --- so the policy gradient theorem no longer applies. The step τ(θ)\nabla_\tau(\theta') may make our policy worse.

When can we re-use experience?

Using the advantage formulation of the RL objective we can write the improvement in value from new parameters θ\theta' as the expectation

J(θ)J(θ)Estdπθ,γ[Eatπθ(st)[Aπθ(st,at)]]=Estdπθ,γ[Eatπθ(st)[πθ(st)πθ(st)Aπθ(st,at)]]\begin{align*} J(\theta') - J(\theta) &\propto \mathbb{E}_{s_t \sim d_{\pi_\theta', \gamma}}\left[\mathbb{E}_{a_t \sim \pi_\theta'(s_t)}\left[A^{\pi_\theta}(s_t, a_t)\right]\right]\\ &= \mathbb{E}_{s_t \sim d_{\pi_\theta', \gamma}}\left[\mathbb{E}_{a_t \sim \pi_\theta(s_t)}\left[\frac{\pi_\theta'(s_t)}{\pi_\theta(s_t)}A^{\pi_\theta}(s_t, a_t)\right]\right] \end{align*}

in which we can use importance sampling to correct for the expectation being taken under the old action distribution. Unfortunately this trick doesn't work for the state visitation distribution, which we have only empirical access to through the environment. So if we want to optimize this objective with respect to θ\theta' we have to hope that the state distribution under πθ\pi_\theta' isn't too different from that under π\pi'.

Bounding the error in the objective

Suppose we knew that our new policy is close to the previous policy in the sense of total variation distance,

πθ(as)πθ(as)ϵ    s,a.|\pi_{\theta'}(a | s) - \pi_{\theta}(a | s)| \le \epsilon\;\;\forall s,a.

It follows that there exists a joint distribution on the actions of the two policies in which the new policy takes the same action as the old policy with probability at least 1ϵ1 - \epsilon at each step. In particular, we can decompose the occupancy probability at time tt into the cases where our 1ϵ1-\epsilon coinflip works out (and we definitely do the same thing as πθ\pi_\theta) and the cases where it doesn't (where we suppose anything can happen):

π(st)=(1ϵ)tπθ(st)+(1(1ϵ)t)pmistake(st)\pi'(s_t) = (1 - \epsilon)^t \pi_\theta(s_t) + \left(1 - (1-\epsilon)^t\right)p_\text{mistake}(s_t)

implying a total variation bound on the state occupancies

πθ(st)πθ(st)(1(1ϵ)t)pmistake(st)πθ(st)(1(1ϵ)t)ϵt\begin{align*} |\pi_{\theta'}(s_t) - \pi_\theta(s_t)| &\le \left(1 - (1-\epsilon)^t\right)|p_\text{mistake}(s_t) - \pi_\theta(s_t)|\\ &\le \left(1 - (1-\epsilon)^t\right)\\ &\le \epsilon t \end{align*}

Now we use a general bound for expectations under total variation distance,

Eπθf(s)=πθ(s)f(s)ds=πθ(s)f(s)ds+(πθ(s)πθ(s))f(s)dsπθ(s)f(s)dsπθ(s)πθ(s)f(s)dsπθ(s)f(s)ds2πθ(s)πθ(s)maxsf(s)\begin{align*} \mathbb{E}_{\pi_{\theta'}}f(s) &= \int \pi_{\theta'}(s) f(s)ds\\ &= \int \pi_{\theta}(s)f(s)ds + \int \left(\pi_{\theta'}(s) - \pi_{\theta}(s)\right) f(s)ds\\ &\ge \int \pi_{\theta}(s)f(s)ds - \int \left|\pi_{\theta'}(s) - \pi_{\theta}(s)\right| |f(s)|ds\\ &\ge \int \pi_{\theta}(s)f(s)ds - 2\left|\pi_{\theta'}(s) - \pi_{\theta}(s)\right| \cdot \max_s |f(s)| \end{align*}

In the specific case of interest, we're taking an expectation over advantages AA, which as sums of rewards are bounded by TrmaxT\cdot r_\text{max} , so we have

t=0T1EstπθEatπθ[πθ(atst)πθ(atst)Aπθ(st,at)]t=0T1Est,atπθ[πθ(atst)πθ(atst)Aπθ(st,at)]t2ϵtTrmax\begin{align*} \sum_{t=0}^{T-1}&\mathbb{E}_{s_t \sim \pi_{\theta'}}\mathbb{E}_{a_t \sim \pi_\theta}\left[\frac{\pi_{\theta'}(a_t | s_t)}{\pi_\theta(a_t | s_t)} A^{\pi_\theta}(s_t, a_t)\right]\\ &\ge \sum_{t=0}^{T-1}\mathbb{E}_{s_t, a_t \sim \pi_{\theta}}\left[\frac{\pi_{\theta'}(a_t | s_t)}{\pi_\theta(a_t | s_t)} A^{\pi_\theta}(s_t, a_t)\right] - \sum_t 2 \epsilon t Tr_\text{max} \end{align*}

The quantity on the right, which we can compute, is a lower bound for the quantity on the left, which we want to maximize. So we can proceed by maximizing the lower bound! But we need to remember that the bound is valid only for policies πθ\pi_{\theta'} that are within a total variation distance ϵ\epsilon of the collection policy πθ\pi_\theta:

maxaπθ(as)πθ(as)ϵ    s.\max_a |\pi_{\theta'}(a | s) - \pi_{\theta}(a | s)| \le \epsilon\;\;\forall s.

Thus we are left with a constrained optimization problem.

Optimization

We can't easily compute the total variation distance between two policies directly, but we can bound it by the KL divergence, which we typically can compute:

maxxp(x)q(x)12DKL(pq)\max_x |p(x) - q(x)| \le \sqrt{\frac{1}{2}\mathcal{D}_\text{KL}(p\|q)}

so that a policy πθ\pi_\theta' satisfying the KL constraint

12DKL(πθ(s)πθ(s))ϵ    s.\sqrt{\frac{1}{2}\mathcal{D}_\text{KL}(\pi_{\theta'}(\cdot | s) \| \pi_{\theta}(\cdot | s))} \le \epsilon\;\;\forall s.

is sufficient for our bound to hold.

Now we can construct a Lagrangian

L(θ,λ)=t=0T1Est,atπθ[πθ(atst)πθ(atst)Aπθ(st,at)]λ(DKL(πθ(s)πθ(s))2ϵ2)\mathcal{L}(\theta', \lambda) = \sum_{t=0}^{T-1}\mathbb{E}_{s_t, a_t \sim \pi_{\theta}}\left[\frac{\pi_{\theta'}(a_t | s_t)}{\pi_\theta(a_t | s_t)} A^{\pi_\theta}(s_t, a_t)\right] - \lambda\left(\mathcal{D}_\text{KL}(\pi_{\theta'}(\cdot | s) \| \pi_{\theta}(\cdot | s)) - 2 \epsilon^2\right)

and follow a dual gradient ascent strategy alternating steps in λ\lambda and in θ\theta. Alternatively, we can just fix λ\lambda (rather than trying to adapt it to chase a fixed ϵ\epsilon) and use the preceding theory to justify following the KL-regularized objective.

Interpretation as natural gradient

Recall that natural gradient descent choose the descent direction that minimizes the objective within a KL-ball of the current point.

Limitations

Because we need to compute KL divergence, the policy distribution must be available in closed form. This precludes using dropout or other sampling inside of the policy network.

We also can't share parameters between the policy and value networks. Why?