Created: April 01, 2022
Modified: April 02, 2022

proof of the policy gradient theorem

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

The policy gradient theorem says that

\begin{align*} \nabla_\theta V^{\pi_\theta} \propto \mathbb{E}_{s\sim\pi_\theta}\left[\sum_a\left(\nabla_\theta \pi_\theta(a | s)\right) Q^{\pi_\theta}(s, a)\right] \end{align*}

For simplicity we'll assume a fixed initial state $s_0$ and fixed-length finite trajectories, but the result can be generalized to discounted-reward or average-reward notions of value in the continuing setting.

It seems like there are two proof approaches:

We can either recursively expand the value as a sequence of Q-values, beginning at the start state, or
We can work with rewards of entire trajectories $R(\tau)$ , leading to a product of $T^2$ interactions between policy choices and rewards, and then remove the 'acausal' terms because they have expectation zero.

Sutton and Barto do the first, while John Schulman does the second in my deep RL notes.

Recursive expansion proof

Reproducing the Sutton and Barto approach for completeness: they expand out the value $V^{\pi_\theta}(s_0)$ to an expectation over q-values $\sum_{a_0} \pi_\theta(a_0 | s_0) Q^{\pi_\theta}(s_0, a_0)$ , push the gradient inside and then apply the product rule

\begin{split} \nabla_\theta \left[\pi_\theta(a_0 | s_0) Q^{\pi_\theta}(s_0, a_0)\right] = &\left(\nabla_\theta \pi_\theta(a_0 | s_0)\right) Q^{\pi_\theta}(s_0, a_0) + \pi_\theta(a_0 | s_0) \nabla_\theta Q^{\pi_\theta}(s_0, a_0) \end{split}

then recursively expand the gradient of the Q-value:

\begin{align*} \nabla_\theta Q^{\pi_\theta}(s_0, a_0) &= \nabla_\theta \left[\sum_{r_1, s_1} p(r_1, s_1 | s_0, a_0) \left(r_1 + \sum_{a_1} \pi_\theta(a_1 | s_1) Q^{\pi_\theta}(s_1, a_1)\right)\right]\\ &= \sum_{r_1, s_1} p(r_1, s_1 | s_0, a_0) \sum_{a_1} \nabla_\theta \left[\pi_\theta(a_1 | s_1) Q^{\pi_\theta}(s_1, a_1)\right]\\ &= \sum_{r_1, s_1} p(r_1, s_1 | s_0, a_0) \sum_{a_1} \left(\left(\nabla_\theta \pi_\theta(a_1 | s_1)\right) Q^{\pi_\theta}(s_1, a_1) + \pi_\theta(a_1 | s_1) \nabla_\theta Q^{\pi_\theta}(s_1, a_1)\right)\\ \end{align*}

so that plugging this all back in we get the policy gradient as a sum in which each term $\nabla_\theta \pi_\theta(a | s) Q^{\pi_\theta}(s, a)$ has weight equal to the expected number of times that state $s$ is visited under policy $\pi_\theta$ , which is just off from the expression at the top (which takes an expectation over states) by a constant factor of T.

Full trajectories derivation

This follows the derivation in deep RL notes.

Cleaning this up a bit we end up with the expression

\nabla_\theta V^{\pi_\theta}(s_0) = \mathbb{E}_\tau \left[\sum_{t=0}^T \gamma^t \left(\nabla_\theta \log \pi_\theta(a_t | s_t)\right) \cdot Q^{\pi_\theta}(s_t, a_t)\right]

which we can see as a variant of the theorem above in which the expectation over state visitations is split into an expectation over trajectories and the empirical sum over states within a trajectory.

Note that it's valid to move between sums of empirical rewards $\sum_{t'>t} \gamma_t r_{t}$ and their expectations---the Q values $Q^{\pi_\theta}(s_t, a_t)$ ---because this all happens inside of the expectation over $s_t, a_t$ . I was worried that in general $r_{t'>t}$ is not independent of $\nabla_\theta \log \pi_\theta(a_t | s_t)$ , but that doesn't matter because the latter is just a constant once we condition on $(s_t, a_t)$ .

proof of the policy gradient theorem

Recursive expansion proof

Full trajectories derivation

Links to this note

policy gradient

Meta