trust region policy optimization: Nonlinear Function
Created: July 05, 2022 Modified: July 06, 2022
trust region policy optimization
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.
The basic setup is that we want to optimize the RL objective J(θ)=Eτ∼πθR(τ) by following Monte Carlo estimates of the policy gradient. Given a sampled trajectory τ∼πθ, we can define an unbiased estimate
Since the gradient estimate is unbiased, Eτ∼πθΔτ(θ)=∇θJ(θ), this defines a valid SGD procedure that will (under appropriate conditions) converge to a local maximum in policy space.
But after going to all the effort to collect that experience it seems inefficient to then just throw it away after a single gradient step. We are tempted to try to re-use the same experience to estimate a new 'gradient' update (or sequence of such updates)
θ′′←θ′+αΔτ(θ′)
but this does not work in general! We're no longer in the on-policy regime --- the policy πθ′ we're trying to update is no longer the same policy that collected our experience τ --- so the policy gradient theorem no longer applies. The step ∇τ(θ′) may make our policy worse.
When can we re-use experience?
Using the advantage formulation of the RL objective we can write the improvement in value from new parameters θ′ as the expectation
in which we can use importance sampling to correct for the expectation being taken under the old action distribution. Unfortunately this trick doesn't work for the state visitation distribution, which we have only empirical access to through the environment. So if we want to optimize this objective with respect to θ′ we have to hope that the state distribution under πθ′ isn't too different from that under π′.
Bounding the error in the objective
Suppose we knew that our new policy is close to the previous policy in the sense of total variation distance,
∣πθ′(a∣s)−πθ(a∣s)∣≤ϵ∀s,a.
It follows that there exists a joint distribution on the actions of the two policies in which the new policy takes the same action as the old policy with probability at least 1−ϵ at each step. In particular, we can decompose the occupancy probability at time t into the cases where our 1−ϵ coinflip works out (and we definitely do the same thing as πθ) and the cases where it doesn't (where we suppose anything can happen):
π′(st)=(1−ϵ)tπθ(st)+(1−(1−ϵ)t)pmistake(st)
implying a total variation bound on the state occupancies
The quantity on the right, which we can compute, is a lower bound for the quantity on the left, which we want to maximize. So we can proceed by maximizing the lower bound! But we need to remember that the bound is valid only for policies πθ′ that are within a total variation distance ϵ of the collection policy πθ:
amax∣πθ′(a∣s)−πθ(a∣s)∣≤ϵ∀s.
Thus we are left with a constrained optimization problem.
Optimization
We can't easily compute the total variation distance between two policies directly, but we can bound it by the KL divergence, which we typically can compute:
xmax∣p(x)−q(x)∣≤21DKL(p∥q)
so that a policy πθ′ satisfying the KL constraint
and follow a dual gradient ascent strategy alternating steps in λ and in θ. Alternatively, we can just fix λ (rather than trying to adapt it to chase a fixed ϵ) and use the preceding theory to justify following the KL-regularized objective.
Interpretation as natural gradient
Recall that natural gradient descent choose the descent direction that minimizes the objective within a KL-ball of the current point.
Limitations
Because we need to compute KL divergence, the policy distribution must be available in closed form. This precludes using dropout or other sampling inside of the policy network.
We also can't share parameters between the policy and value networks. Why?