Created: March 29, 2022
Modified: January 06, 2023

policy gradient

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

(see also my deep RL notes from John Schulman's class several years ago, which cover much of the same material)

We can approach reinforcement learning as learning a policy $\pi_\theta$ by following the gradient of its value $V_{\pi_\theta}$ . For simplicity we'll assume a fixed initial state $s_0$ and fixed-length finite trajectories, but these results can be generalized to discounted-reward or average-reward notions of value in the continuing setting.

The policy gradient theorem says that

\begin{align*} \nabla_\theta V^{\pi_\theta} \propto \mathbb{E}_{s\sim\pi_\theta}\left[\sum_a \left(\nabla_\theta \pi_\theta(a | s)\right) Q^{\pi_\theta}(s, a)\right] \end{align*}

See the proof of the policy gradient theorem. Generally the score function trick is used to refine this into an expectation

\nabla_\theta V^{\pi_\theta} \propto \mathbb{E}_{s, a\sim\pi_\theta}\left[\left(\nabla_\theta \log \pi_\theta(a | s)\right) Q^{\pi_\theta}(s, a)\right]

The simplest possible policy gradient algorithm, called REINFORCE, computes a naive unbiased estimate of this gradient using a sampled trajectory $\tau \sim \pi_{\theta}$ :

\theta \leftarrow \theta + \alpha \left[\sum_{t=0}^\infty v_t \cdot \nabla_\theta \log \pi_\theta(a_t | s_t)\right].

where $v_t = \sum_{k=t}^T r_t$ is the empirical return at step $t$ . Although unbiased, this update is exceedingly high-variance and not usually practical on its own. We'll discuss improvements below.

Why learn a policy

Why would we learn by following a policy gradient, instead of by estimating state or action values?

A learned policy can be stochastic.
A learned policy can act in a continuous space.
Policy learning may converge more reliably than value learning.

Potential downsides are that policy learning converges to local (not global) maxima, and the updates can be much higher variance.

Variance reduction

Where does the variance come from in a naive policy gradient estimate? Let's consider a simple model whose variance we can analyze in closed form: a stateless (bandit) setting with two possible actions $a\in \{0, 1\}$ , a Bernoulli policy $\pi_\theta$ parameterized by log-odds $\theta$ ,

\pi_\theta(a=1) =1 / \left(1 + e^{-\theta} \right),

and stochastic reward represented by a random variable $r$ such that $r|a$ has expected value $\bar{r}_1$ or $\bar{r}_0$ (for $a=1$ or $a=0$ , respectively), and variance $\rho^2$ (independent of the action).

Considering the two cases, under the first action $a=1$ the score function becomes

\nabla_\theta \log \pi_\theta(a=1) = \nabla_\theta \left (- 1 - e^{-\theta}\right) = e^{-\theta}

and similarly for $a=0$ we have

\nabla_\theta \log \pi_\theta(a=1) = \nabla_\theta \left(- 1 - e^{\theta}\right) = -e^\theta.

If we initialize at the uniform distribution over actions, $\theta = 0$ , then these derivatives simplify to $1$ and -1 respectively, so we have expected gradient

\begin{align*} \mathbb{E}[g] &= \pi_\theta(a=1) \cdot \mathbb{E}[r | a=1] \cdot e^{-\theta} - \pi_\theta(a=0) \mathbb{E}[r | a=0] \cdot e^{\theta}\\ &= \frac{1}{2}\left(\bar{r}_1 - \bar{r}_0\right).\end{align*}

This will tend to pull the parameter $\theta$ towards whichever action has larger reward, as we would hope. Moving to the variance, we exploit the identity $\text{Var}(g) = \mathbb{E}[g^2] - \mathbb{E}[g]^2$ , first computing

\begin{align*} \mathbb{E}[g^2] &= \frac{1}{2}\mathbb{E}[r^2 | a=1] \cdot 1^2 + \frac{1}{2}\mathbb{E}[r^2 | a=0] \cdot (-1)^2 \\ &= \frac{1}{2}\left(\rho^2 + \bar{r}_1^2\right) + \frac{1}{2}\left(\rho^2 + \bar{r}_0^2\right)\\ &= \frac{1}{2}\left(\bar{r}_1^2 + \bar{r}_0^2 + 2 \rho^2\right) \end{align*}

(in which the second line uses that same identity again, to express the expected squared reward as the sum of its variance and squared mean), and then plug this in to derive

\begin{align*} \text{Var}(g) &= \mathbb{E}[g^2] - (\mathbb{E}[g]^2)\\ &= \frac{1}{2}\left(r_1^2 + r_0^2 + 2\rho^2\right) - \frac{1}{4}\left(\bar{r}_1 - \bar{r}_0\right)^2\\ &= \left(\frac{1}{2}\bar{r}_1 + \frac{1}{2}\bar{r}_0\right)^2 + \rho^2\\ &= \mathbb{E}[r]^2 + \rho^2. \end{align*}

We see that the overall gradient variance is the sum of two terms: the intrinsic variance $\rho^2$ of the reward, and a term $\mathbb{E}[r]^2$ that measures the magnitude (in the squared norm) of the reward averaged over actions.

Note that this latter term is sensitive to how 'centered' the reward function is: for actions with expected rewards $\bar{r}_1 = 5$ and $\bar{r}_0 = -5$ this term would be zero, but if we apply a constant shift to $\bar{r}_1 = 105$ and $\bar{r}_0 = 95$ then this term contributes a very large quantity to the variance, even though the two situations are decision-theoretically equal. This observation motivates the first of our variance reduction techniques: we should try to 'center' the reward function (more generally, the action-value function) by estimating and subtract the average or 'baseline' value at each state.

So far we've considered only the initial gradient under the uniform policy, but ultimately we are hoping to reach the optimal policy, which in general will put all of its mass on whichever of the two actions yields higher expected reward. What happens in this extreme? Let's redo the analysis, replacing the uniform parameter $\theta = 0$ with the general case $\theta = \log \frac{1-\epsilon}{\epsilon}$ , so that $\pi_\theta$ puts mass $1-\epsilon$ on action 1 and only $\epsilon$ on action 0. Now we have

\begin{align*} \mathbb{E}[g] &= \pi_\theta(a=1) \cdot \bar{r}_1 \cdot e^{-\theta} - \pi_\theta(a=0) \cdot \bar{r}_0 \cdot e^{\theta}\\ &= (1-\epsilon) \cdot \bar{r}_1 \cdot \frac{\epsilon}{1-\epsilon} - \epsilon \cdot \bar{r}_0 \cdot \frac{1-\epsilon}{\epsilon} \\ &= \epsilon \cdot \bar{r}_1 - (1 - \epsilon) \cdot \bar{r}_0\end{align*}

and

\begin{align*} \mathbb{E}[g^2] &= (1-\epsilon) \cdot \mathbb{E}[r^2 | a=1] \cdot \left(\frac{\epsilon}{1-\epsilon}\right)^2 + \epsilon \cdot \mathbb{E}[r^2 | a=0] \cdot \left(\frac{-(1-\epsilon)}{\epsilon}\right)^2 \\ &= \frac{\epsilon^2}{1-\epsilon} \left(\rho^2 + \bar{r}_1^2\right) + \frac{(1-\epsilon)^2}{\epsilon}\left(\rho^2 + \bar{r}_0^2\right). \end{align*}

We observe that these recover our previous results in the special case $\epsilon = 0.5$ . But as $\epsilon\to 0$ , we have

\begin{align*} \mathbb{E}[g] &\to \bar{r}_0\\ \mathbb{E}[g^2] &\approx \frac{1}{\epsilon}\left(\rho^2 + \bar{r}_0^2\right)\\ &\to \infty\text{ as } \epsilon\to 0\\ \end{align*}

so the variance $\mathbb{E}[g^2] - \mathbb{E}[g]^2$ becomes infinite. That's not good! Looking back at the two cases, we see that this near-optimal policy will almost always (with probability $1-\epsilon$ ) sample action 1 , and the expected gradient $\bar{r}_1 \cdot \frac{\epsilon}{1-\epsilon}$ in that case is essentially zero due to the score function term: a small change in $\theta$ won't appreciably change the log-probability of action 1. The policy will almost never sample action 0, but when it does, the score function and thus the gradient will be huge. This is essentially due to the log-odds policy parameterization.

Baselines

Since the expectation of the score function itself is zero, we can subtract any constant multiple of it without changing the expectation. A particularly natural baseline for the Q-value (motivated by the toy model in the previous section, but which can also be derived by a more general analysis) is the state value $V^{\pi}$ , obtained by averaging Q-values under the policy's action distribution. The thus-'centered' Q value is called the advantage:

A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)

The advantage function case be estimated indirectly by maintaining separate models of Q and V, which can even be updated at different time scales (corresponding to different $\lambda$ values). Or we can use the temporal difference error $\delta_t$ as a direct estimate of the advantage, since:

\begin{align*} \mathbb{E}_{\pi}\left[\delta_t | s_t, a_t\right] &= \mathbb{E}_{\pi}\left[r_{t+1} + \gamma V^\pi(s_{t+1}) | s_t, a_t\right] - V^\pi(s_t)\\ &= Q^\pi(s_t, a_t) - V^\pi(s_t)\\ &= A^\pi(s, a) \end{align*}

In general we can of course use longer-timescale temporal differences $\delta^{\lambda}_t$ , since these are all estimates of the same underlying quantity.

Using a critic

The expression for the policy gradient includes Q-values, and we can use any estimate of these we like. We saw the naive Monte Carlo estimate above, in the REINFORCE algorithm. More generally, we can use modeled Q-values trained using a temporal difference method such as SARSA( $\lambda$ ). The model of Q-values is known as the 'critic', and the resulting overall method is an actor-critic method.

In general, the critic's value estimates will be biased, leading to biased policy gradients.

Continual learning and online updates

Using a critic we can apply a policy gradient update at each step based on the estimated return instead of waiting until the end of a trajectory to get its empirical return. Unlike trajectories as a whole, state-action pairs at different timesteps are not independent or identically distributed, so we no longer have the convergence guarantee of stochastic gradient ascent, if we ever did (the critic's biased gradients would also break these guarantees even if we only updated at the trajectory level).

Still, online updating may help the system learn faster in practice (indeed, in the continual-learning setting of a single never-ending trajectory this is the only way for any learning to happen).

Naively, the update to $\theta$ at time $t$ will be the score function $\nabla_\theta \log \pi_\theta(a_t | s_t)$ times some scalar multiple $\hat{A}(s_t, a_t)$ : if the action gave us a positive advantage, we increase its probability, otherwise we decrease it. But we may also want to update the policy for previous actions based on our current experience. This can be done by replacing the score function with an eligibility trace that accumulates an exponentially decreasing average of the score functions from recent actions.

As policy iteration

We can view policy-gradient methods as a form of 'softened' policy iteration, or generalized policy iteration. The loopReference: Sergey Levine's Berkeley Deep RL lecture https://www.youtube.com/watch?v=ySenCHPsKJU is:

Estimate $A^{\pi_\theta}(s, a)$ for some or all state-action pairs. (in the simplest case, this estimate is just the Monte Carlo rewards seen in the collected trajectory).
Update the policy using these estimated advantages.

A full update would set $\pi'(s) = \arg\max A^{\pi_\theta}(s, a)$

Fully updating on experience

Experience is expensive. Given a sampled trajectory, we should aspire to a proper Bayesian update, which would incorporate all information present in the trajectory to narrow down the possible set of optimal policies. Actually implementing a Bayesian policy optimization would require

representing a distribution over optimal policies, rather than just a single policy, and
updating that distribution fully in response to new evidence - this would look like fully optimizing a variational objective, not just taking a single gradient step.

In practice, we might at least think of trying to take multiple gradient steps on the sample objective defined by some observed experience. The problem is that after taking the first such step, we're no longer in the on-policy regime and the policy gradient theorem no longer holds. This can be worked around by making sure to not travel too far away in policy space. One approach for this is trust region policy optimization (TRPO).

More generally, we can decouple the training into multiple processes, which can operate asynchronously:

One process collects data and stores it for experience replay.
One process uses the replay buffer to train a value function, e.g., via Q-learning.
One process optimizes the policy using the surrogate value function.

This kind of decomposition (which I've discussed in the context of deep deterministic policy gradient methods, though the ideas are more general) seems like a promising strategy for unifying value and policy-based approaches to RL.

Reparameterization and preconditioning

Once we get past all of the RL-specific aspects of gradient estimation, all the usual tricks for speeding up stochastic gradient optimizations still apply. It's important to choose a good parameterization for the actor and critic; ideas like natural gradient and mirror descent are applicable.

Deterministic policy

In continuous action spaces $a \in \mathbb{R}^d$ it can be convenient to use a deterministic policy $a = \mu_\theta(s)$ , so that $V_{\mu_\theta}(s) = Q(s, \mu_\theta(s))$ . Then

\begin{align*} \nabla_\theta V^{\mu_\theta}(s) &= \nabla_\theta Q(s, \mu_\theta(s))\\ &= \left(J_{\theta\to\mu_\theta(s)}\right)^T \nabla_a Q(s, a) \\ \end{align*}

where $J_{\theta\to\mu_\theta(s)} \in \mathbb{R}^{d\times n}$ is in general the Jacobian matrix of the policy giving the sensitivity of each action dimension to each policy parameter. Then we simply need the gradient of $Q$ (s, a) with respect to the action $a$ , which we can query directly given a differentiable model of $Q$ .

Note that if we optimized $Q(s,\mu_\theta(s))$ to completion with a sufficiently flexible family, we would obtain the greedy policy $\mu = \arg⁡\max Q$ ; in other words we would simply be doing (deep) Q-learning. Thus we see this as an approach to Q-learning in continuous action spaces, where interleaving updates of $\pi$ and $Q$ is a case of generalized policy iteration.

Silver et al. (2014) shows that the above expression is also the limit of the usual stochastic policy gradient as the policy variance approaches zero.

Lillicrap et al. (2015) demonstrated that this can work with deep networks if we apply the DQN tricks of experience replay and target networks to learn the Q function.

Entropy bonus

One approach to encourage exploration (used, e.g., in A3C) is to add the entropy gradient $\beta \nabla_\theta H(\pi_\theta(\cdot | s_t))$ to the policy gradient at each step. This 'entropy bonus' tries to prevent the policy from collapsing into a deterministic policy, unless the reward function provides really strong pressure for it to do so.

What objective does this procedure optimize? Well, everything happens inside an expectation over states reached by the policy, and we've seen that when we push gradients inside of that expectation we generally have to use the score function trick to account for the change in distribution introduced by a change in policy. But (unlike the usual policy gradient) the entropy bonus term doesn't include the score function! Thus we can view it as a modified gradient

\mathbb{E}_{s\sim\pi_\theta}\left[\nabla_\theta H(\pi_\theta(\cdot | s))\right] = \nabla_\theta \mathbb{E}_{s\sim\pi_\theta}\left[H(\pi_\theta(\cdot | \text{stop\_gradient}(s)))\right]

which ignores the path by which changing $\theta$ will change the visited states $s$ . So this approach will try to increase the entropy of the policy at the currently-visited states, but it makes no attempt to cause the policy to visit states in which it will have high entropy. This myopic approach to maximizing entropy is sometimes called a 'one-step' bonus.

The natural alternative to a one-step bonus is to fold the policy entropy into the reward function itself, so that future entropies propagate into the value and $Q$ functions. This is called maximum-entropy reinforcement learning and seems to be a nicer and more coherent approach with some theoretical benefits.

Criticism

Ben Recht (and others probably) has argued that policy gradient methods are just glorified random search. Because they don't know anything about the reward being optimized, they can only stumble blindly around, and they scale poorly with dimension, making them generally bad algorithms.

This seems like a reasonable objection to naive policy gradient methods. I think it becomes much less potent against actor-critic methods such as deep deterministic policy gradient, since those do learn a model of the value function, and can use gradients of that surrogate model to guide policy learning.

policy gradient

Why learn a policy

Variance reduction

Baselines

Using a critic

Continual learning and online updates

As policy iteration

Fully updating on experience

Reparameterization and preconditioning

Deterministic policy

Entropy bonus

Criticism

Links to this note

deep deterministic policy gradient

trust region policy optimization

maximum-entropy reinforcement learning

actor-critic

rl diagnostics

policy gradient

state values, then action values

value aligned language game

hard attention

score function

reinforcement learning

Meta