policy gradient: Nonlinear Function
Created: March 29, 2022
Modified: January 06, 2023

policy gradient

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

(see also my deep RL notes from John Schulman's class several years ago, which cover much of the same material)

We can approach reinforcement learning as learning a policy πθ\pi_\theta by following the gradient of its value VπθV_{\pi_\theta}. For simplicity we'll assume a fixed initial state s0s_0 and fixed-length finite trajectories, but these results can be generalized to discounted-reward or average-reward notions of value in the continuing setting.

The policy gradient theorem says that

θVπθEsπθ[a(θπθ(as))Qπθ(s,a)]\begin{align*} \nabla_\theta V^{\pi_\theta} \propto \mathbb{E}_{s\sim\pi_\theta}\left[\sum_a \left(\nabla_\theta \pi_\theta(a | s)\right) Q^{\pi_\theta}(s, a)\right] \end{align*}

See the proof of the policy gradient theorem. Generally the score function trick is used to refine this into an expectation

θVπθEs,aπθ[(θlogπθ(as))Qπθ(s,a)]\nabla_\theta V^{\pi_\theta} \propto \mathbb{E}_{s, a\sim\pi_\theta}\left[\left(\nabla_\theta \log \pi_\theta(a | s)\right) Q^{\pi_\theta}(s, a)\right]

The simplest possible policy gradient algorithm, called REINFORCE, computes a naive unbiased estimate of this gradient using a sampled trajectory τπθ\tau \sim \pi_{\theta}:

θθ+α[t=0vtθlogπθ(atst)].\theta \leftarrow \theta + \alpha \left[\sum_{t=0}^\infty v_t \cdot \nabla_\theta \log \pi_\theta(a_t | s_t)\right].

where vt=k=tTrtv_t = \sum_{k=t}^T r_t is the empirical return at step tt. Although unbiased, this update is exceedingly high-variance and not usually practical on its own. We'll discuss improvements below.

Why learn a policy

Why would we learn by following a policy gradient, instead of by estimating state or action values?

  • A learned policy can be stochastic.
  • A learned policy can act in a continuous space.
  • Policy learning may converge more reliably than value learning.

Potential downsides are that policy learning converges to local (not global) maxima, and the updates can be much higher variance.

Variance reduction

Where does the variance come from in a naive policy gradient estimate? Let's consider a simple model whose variance we can analyze in closed form: a stateless (bandit) setting with two possible actions a{0,1}a\in \{0, 1\}, a Bernoulli policy πθ\pi_\theta parameterized by log-odds θ\theta,

πθ(a=1)=1/(1+eθ),\pi_\theta(a=1) =1 / \left(1 + e^{-\theta} \right),

and stochastic reward represented by a random variable rr such that rar|a has expected value rˉ1\bar{r}_1 or rˉ0\bar{r}_0 (for a=1a=1 or a=0a=0, respectively), and variance ρ2\rho^2 (independent of the action).

Considering the two cases, under the first action a=1a=1 the score function becomes

θlogπθ(a=1)=θ(1eθ)=eθ\nabla_\theta \log \pi_\theta(a=1) = \nabla_\theta \left (- 1 - e^{-\theta}\right) = e^{-\theta}

and similarly for a=0a=0 we have

θlogπθ(a=1)=θ(1eθ)=eθ.\nabla_\theta \log \pi_\theta(a=1) = \nabla_\theta \left(- 1 - e^{\theta}\right) = -e^\theta.

If we initialize at the uniform distribution over actions, θ=0\theta = 0, then these derivatives simplify to 11 and -1 respectively, so we have expected gradient

E[g]=πθ(a=1)E[ra=1]eθπθ(a=0)E[ra=0]eθ=12(rˉ1rˉ0).\begin{align*} \mathbb{E}[g] &= \pi_\theta(a=1) \cdot \mathbb{E}[r | a=1] \cdot e^{-\theta} - \pi_\theta(a=0) \mathbb{E}[r | a=0] \cdot e^{\theta}\\ &= \frac{1}{2}\left(\bar{r}_1 - \bar{r}_0\right).\end{align*}

This will tend to pull the parameter θ\theta towards whichever action has larger reward, as we would hope. Moving to the variance, we exploit the identity Var(g)=E[g2]E[g]2\text{Var}(g) = \mathbb{E}[g^2] - \mathbb{E}[g]^2, first computing

E[g2]=12E[r2a=1]12+12E[r2a=0](1)2=12(ρ2+rˉ12)+12(ρ2+rˉ02)=12(rˉ12+rˉ02+2ρ2)\begin{align*} \mathbb{E}[g^2] &= \frac{1}{2}\mathbb{E}[r^2 | a=1] \cdot 1^2 + \frac{1}{2}\mathbb{E}[r^2 | a=0] \cdot (-1)^2 \\ &= \frac{1}{2}\left(\rho^2 + \bar{r}_1^2\right) + \frac{1}{2}\left(\rho^2 + \bar{r}_0^2\right)\\ &= \frac{1}{2}\left(\bar{r}_1^2 + \bar{r}_0^2 + 2 \rho^2\right) \end{align*}

(in which the second line uses that same identity again, to express the expected squared reward as the sum of its variance and squared mean), and then plug this in to derive

Var(g)=E[g2](E[g]2)=12(r12+r02+2ρ2)14(rˉ1rˉ0)2=(12rˉ1+12rˉ0)2+ρ2=E[r]2+ρ2.\begin{align*} \text{Var}(g) &= \mathbb{E}[g^2] - (\mathbb{E}[g]^2)\\ &= \frac{1}{2}\left(r_1^2 + r_0^2 + 2\rho^2\right) - \frac{1}{4}\left(\bar{r}_1 - \bar{r}_0\right)^2\\ &= \left(\frac{1}{2}\bar{r}_1 + \frac{1}{2}\bar{r}_0\right)^2 + \rho^2\\ &= \mathbb{E}[r]^2 + \rho^2. \end{align*}

We see that the overall gradient variance is the sum of two terms: the intrinsic variance ρ2\rho^2 of the reward, and a term E[r]2\mathbb{E}[r]^2 that measures the magnitude (in the squared norm) of the reward averaged over actions.

Note that this latter term is sensitive to how 'centered' the reward function is: for actions with expected rewards rˉ1=5\bar{r}_1 = 5 and rˉ0=5\bar{r}_0 = -5 this term would be zero, but if we apply a constant shift to rˉ1=105\bar{r}_1 = 105 and rˉ0=95\bar{r}_0 = 95 then this term contributes a very large quantity to the variance, even though the two situations are decision-theoretically equal. This observation motivates the first of our variance reduction techniques: we should try to 'center' the reward function (more generally, the action-value function) by estimating and subtract the average or 'baseline' value at each state.

So far we've considered only the initial gradient under the uniform policy, but ultimately we are hoping to reach the optimal policy, which in general will put all of its mass on whichever of the two actions yields higher expected reward. What happens in this extreme? Let's redo the analysis, replacing the uniform parameter θ=0\theta = 0 with the general case θ=log1ϵϵ\theta = \log \frac{1-\epsilon}{\epsilon}, so that πθ\pi_\theta puts mass 1ϵ1-\epsilon on action 1 and only ϵ\epsilon on action 0. Now we have

E[g]=πθ(a=1)rˉ1eθπθ(a=0)rˉ0eθ=(1ϵ)rˉ1ϵ1ϵϵrˉ01ϵϵ=ϵrˉ1(1ϵ)rˉ0\begin{align*} \mathbb{E}[g] &= \pi_\theta(a=1) \cdot \bar{r}_1 \cdot e^{-\theta} - \pi_\theta(a=0) \cdot \bar{r}_0 \cdot e^{\theta}\\ &= (1-\epsilon) \cdot \bar{r}_1 \cdot \frac{\epsilon}{1-\epsilon} - \epsilon \cdot \bar{r}_0 \cdot \frac{1-\epsilon}{\epsilon} \\ &= \epsilon \cdot \bar{r}_1 - (1 - \epsilon) \cdot \bar{r}_0\end{align*}

and

E[g2]=(1ϵ)E[r2a=1](ϵ1ϵ)2+ϵE[r2a=0]((1ϵ)ϵ)2=ϵ21ϵ(ρ2+rˉ12)+(1ϵ)2ϵ(ρ2+rˉ02).\begin{align*} \mathbb{E}[g^2] &= (1-\epsilon) \cdot \mathbb{E}[r^2 | a=1] \cdot \left(\frac{\epsilon}{1-\epsilon}\right)^2 + \epsilon \cdot \mathbb{E}[r^2 | a=0] \cdot \left(\frac{-(1-\epsilon)}{\epsilon}\right)^2 \\ &= \frac{\epsilon^2}{1-\epsilon} \left(\rho^2 + \bar{r}_1^2\right) + \frac{(1-\epsilon)^2}{\epsilon}\left(\rho^2 + \bar{r}_0^2\right). \end{align*}

We observe that these recover our previous results in the special case ϵ=0.5\epsilon = 0.5. But as ϵ0\epsilon\to 0, we have

E[g]rˉ0E[g2]1ϵ(ρ2+rˉ02) as ϵ0\begin{align*} \mathbb{E}[g] &\to \bar{r}_0\\ \mathbb{E}[g^2] &\approx \frac{1}{\epsilon}\left(\rho^2 + \bar{r}_0^2\right)\\ &\to \infty\text{ as } \epsilon\to 0\\ \end{align*}

so the variance E[g2]E[g]2\mathbb{E}[g^2] - \mathbb{E}[g]^2 becomes infinite. That's not good! Looking back at the two cases, we see that this near-optimal policy will almost always (with probability 1ϵ1-\epsilon) sample action 1 , and the expected gradient rˉ1ϵ1ϵ\bar{r}_1 \cdot \frac{\epsilon}{1-\epsilon} in that case is essentially zero due to the score function term: a small change in θ\theta won't appreciably change the log-probability of action 1. The policy will almost never sample action 0, but when it does, the score function and thus the gradient will be huge. This is essentially due to the log-odds policy parameterization.

Baselines

Since the expectation of the score function itself is zero, we can subtract any constant multiple of it without changing the expectation. A particularly natural baseline for the Q-value (motivated by the toy model in the previous section, but which can also be derived by a more general analysis) is the state value VπV^{\pi}, obtained by averaging Q-values under the policy's action distribution. The thus-'centered' Q value is called the advantage:

Aπ(s,a)=Qπ(s,a)Vπ(s)A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)

The advantage function case be estimated indirectly by maintaining separate models of Q and V, which can even be updated at different time scales (corresponding to different λ\lambda values). Or we can use the temporal difference error δt\delta_t as a direct estimate of the advantage, since:

Eπ[δtst,at]=Eπ[rt+1+γVπ(st+1)st,at]Vπ(st)=Qπ(st,at)Vπ(st)=Aπ(s,a)\begin{align*} \mathbb{E}_{\pi}\left[\delta_t | s_t, a_t\right] &= \mathbb{E}_{\pi}\left[r_{t+1} + \gamma V^\pi(s_{t+1}) | s_t, a_t\right] - V^\pi(s_t)\\ &= Q^\pi(s_t, a_t) - V^\pi(s_t)\\ &= A^\pi(s, a) \end{align*}

In general we can of course use longer-timescale temporal differences δtλ\delta^{\lambda}_t, since these are all estimates of the same underlying quantity.

Using a critic

The expression for the policy gradient includes Q-values, and we can use any estimate of these we like. We saw the naive Monte Carlo estimate above, in the REINFORCE algorithm. More generally, we can use modeled Q-values trained using a temporal difference method such as SARSA(λ\lambda). The model of Q-values is known as the 'critic', and the resulting overall method is an actor-critic method.

In general, the critic's value estimates will be biased, leading to biased policy gradients.

Continual learning and online updates

Using a critic we can apply a policy gradient update at each step based on the estimated return instead of waiting until the end of a trajectory to get its empirical return. Unlike trajectories as a whole, state-action pairs at different timesteps are not independent or identically distributed, so we no longer have the convergence guarantee of stochastic gradient ascent, if we ever did (the critic's biased gradients would also break these guarantees even if we only updated at the trajectory level).

Still, online updating may help the system learn faster in practice (indeed, in the continual-learning setting of a single never-ending trajectory this is the only way for any learning to happen).

Naively, the update to θ\theta at time tt will be the score function θlogπθ(atst)\nabla_\theta \log \pi_\theta(a_t | s_t) times some scalar multiple A^(st,at)\hat{A}(s_t, a_t): if the action gave us a positive advantage, we increase its probability, otherwise we decrease it. But we may also want to update the policy for previous actions based on our current experience. This can be done by replacing the score function with an eligibility trace that accumulates an exponentially decreasing average of the score functions from recent actions.

As policy iteration

We can view policy-gradient methods as a form of 'softened' policy iteration, or generalized policy iteration. The loopReference: Sergey Levine's Berkeley Deep RL lecture https://www.youtube.com/watch?v=ySenCHPsKJU is:

  1. Estimate Aπθ(s,a)A^{\pi_\theta}(s, a) for some or all state-action pairs. (in the simplest case, this estimate is just the Monte Carlo rewards seen in the collected trajectory).
  2. Update the policy using these estimated advantages.

A full update would set π(s)=argmaxAπθ(s,a)\pi'(s) = \arg\max A^{\pi_\theta}(s, a)

Fully updating on experience

Experience is expensive. Given a sampled trajectory, we should aspire to a proper Bayesian update, which would incorporate all information present in the trajectory to narrow down the possible set of optimal policies. Actually implementing a Bayesian policy optimization would require

  • representing a distribution over optimal policies, rather than just a single policy, and
  • updating that distribution fully in response to new evidence - this would look like fully optimizing a variational objective, not just taking a single gradient step.

In practice, we might at least think of trying to take multiple gradient steps on the sample objective defined by some observed experience. The problem is that after taking the first such step, we're no longer in the on-policy regime and the policy gradient theorem no longer holds. This can be worked around by making sure to not travel too far away in policy space. One approach for this is trust region policy optimization (TRPO).

More generally, we can decouple the training into multiple processes, which can operate asynchronously:

  1. One process collects data and stores it for experience replay.
  2. One process uses the replay buffer to train a value function, e.g., via Q-learning.
  3. One process optimizes the policy using the surrogate value function.

This kind of decomposition (which I've discussed in the context of deep deterministic policy gradient methods, though the ideas are more general) seems like a promising strategy for unifying value and policy-based approaches to RL.

Reparameterization and preconditioning

Once we get past all of the RL-specific aspects of gradient estimation, all the usual tricks for speeding up stochastic gradient optimizations still apply. It's important to choose a good parameterization for the actor and critic; ideas like natural gradient and mirror descent are applicable.

Deterministic policy

In continuous action spaces aRda \in \mathbb{R}^d it can be convenient to use a deterministic policy a=μθ(s)a = \mu_\theta(s), so that Vμθ(s)=Q(s,μθ(s))V_{\mu_\theta}(s) = Q(s, \mu_\theta(s)). Then

θVμθ(s)=θQ(s,μθ(s))=(Jθμθ(s))TaQ(s,a)\begin{align*} \nabla_\theta V^{\mu_\theta}(s) &= \nabla_\theta Q(s, \mu_\theta(s))\\ &= \left(J_{\theta\to\mu_\theta(s)}\right)^T \nabla_a Q(s, a) \\ \end{align*}

where Jθμθ(s)Rd×nJ_{\theta\to\mu_\theta(s)} \in \mathbb{R}^{d\times n} is in general the Jacobian matrix of the policy giving the sensitivity of each action dimension to each policy parameter. Then we simply need the gradient of QQ(s, a) with respect to the action aa, which we can query directly given a differentiable model of QQ.

Note that if we optimized Q(s,μθ(s))Q(s,\mu_\theta(s)) to completion with a sufficiently flexible family, we would obtain the greedy policy μ=argmaxQ\mu = \arg⁡\max Q; in other words we would simply be doing (deep) Q-learning. Thus we see this as an approach to Q-learning in continuous action spaces, where interleaving updates of π\pi and QQ is a case of generalized policy iteration.

Silver et al. (2014) shows that the above expression is also the limit of the usual stochastic policy gradient as the policy variance approaches zero.

Lillicrap et al. (2015) demonstrated that this can work with deep networks if we apply the DQN tricks of experience replay and target networks to learn the Q function.

Entropy bonus

One approach to encourage exploration (used, e.g., in A3C) is to add the entropy gradient βθH(πθ(st))\beta \nabla_\theta H(\pi_\theta(\cdot | s_t)) to the policy gradient at each step. This 'entropy bonus' tries to prevent the policy from collapsing into a deterministic policy, unless the reward function provides really strong pressure for it to do so.

What objective does this procedure optimize? Well, everything happens inside an expectation over states reached by the policy, and we've seen that when we push gradients inside of that expectation we generally have to use the score function trick to account for the change in distribution introduced by a change in policy. But (unlike the usual policy gradient) the entropy bonus term doesn't include the score function! Thus we can view it as a modified gradient

Esπθ[θH(πθ(s))]=θEsπθ[H(πθ(stop_gradient(s)))]\mathbb{E}_{s\sim\pi_\theta}\left[\nabla_\theta H(\pi_\theta(\cdot | s))\right] = \nabla_\theta \mathbb{E}_{s\sim\pi_\theta}\left[H(\pi_\theta(\cdot | \text{stop\_gradient}(s)))\right]

which ignores the path by which changing θ\theta will change the visited states ss. So this approach will try to increase the entropy of the policy at the currently-visited states, but it makes no attempt to cause the policy to visit states in which it will have high entropy. This myopic approach to maximizing entropy is sometimes called a 'one-step' bonus.

The natural alternative to a one-step bonus is to fold the policy entropy into the reward function itself, so that future entropies propagate into the value and QQ functions. This is called maximum-entropy reinforcement learning and seems to be a nicer and more coherent approach with some theoretical benefits.

Criticism

Ben Recht (and others probably) has argued that policy gradient methods are just glorified random search. Because they don't know anything about the reward being optimized, they can only stumble blindly around, and they scale poorly with dimension, making them generally bad algorithms.

This seems like a reasonable objection to naive policy gradient methods. I think it becomes much less potent against actor-critic methods such as deep deterministic policy gradient, since those do learn a model of the value function, and can use gradients of that surrogate model to guide policy learning.