proof of the policy gradient theorem: Nonlinear Function
Created: April 01, 2022
Modified: April 02, 2022

proof of the policy gradient theorem

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

The policy gradient theorem says that

θVπθEsπθ[a(θπθ(as))Qπθ(s,a)]\begin{align*} \nabla_\theta V^{\pi_\theta} \propto \mathbb{E}_{s\sim\pi_\theta}\left[\sum_a\left(\nabla_\theta \pi_\theta(a | s)\right) Q^{\pi_\theta}(s, a)\right] \end{align*}

For simplicity we'll assume a fixed initial state s0s_0 and fixed-length finite trajectories, but the result can be generalized to discounted-reward or average-reward notions of value in the continuing setting.

It seems like there are two proof approaches:

  1. We can either recursively expand the value as a sequence of Q-values, beginning at the start state, or
  2. We can work with rewards of entire trajectories R(τ)R(\tau), leading to a product of T2T^2 interactions between policy choices and rewards, and then remove the 'acausal' terms because they have expectation zero.

Sutton and Barto do the first, while John Schulman does the second in my deep RL notes.

Recursive expansion proof

Reproducing the Sutton and Barto approach for completeness: they expand out the value Vπθ(s0)V^{\pi_\theta}(s_0) to an expectation over q-values a0πθ(a0s0)Qπθ(s0,a0)\sum_{a_0} \pi_\theta(a_0 | s_0) Q^{\pi_\theta}(s_0, a_0), push the gradient inside and then apply the product rule

θ[πθ(a0s0)Qπθ(s0,a0)]=(θπθ(a0s0))Qπθ(s0,a0)+πθ(a0s0)θQπθ(s0,a0)\begin{split} \nabla_\theta \left[\pi_\theta(a_0 | s_0) Q^{\pi_\theta}(s_0, a_0)\right] = &\left(\nabla_\theta \pi_\theta(a_0 | s_0)\right) Q^{\pi_\theta}(s_0, a_0) + \pi_\theta(a_0 | s_0) \nabla_\theta Q^{\pi_\theta}(s_0, a_0) \end{split}

then recursively expand the gradient of the Q-value:

θQπθ(s0,a0)=θ[r1,s1p(r1,s1s0,a0)(r1+a1πθ(a1s1)Qπθ(s1,a1))]=r1,s1p(r1,s1s0,a0)a1θ[πθ(a1s1)Qπθ(s1,a1)]=r1,s1p(r1,s1s0,a0)a1((θπθ(a1s1))Qπθ(s1,a1)+πθ(a1s1)θQπθ(s1,a1))\begin{align*} \nabla_\theta Q^{\pi_\theta}(s_0, a_0) &= \nabla_\theta \left[\sum_{r_1, s_1} p(r_1, s_1 | s_0, a_0) \left(r_1 + \sum_{a_1} \pi_\theta(a_1 | s_1) Q^{\pi_\theta}(s_1, a_1)\right)\right]\\ &= \sum_{r_1, s_1} p(r_1, s_1 | s_0, a_0) \sum_{a_1} \nabla_\theta \left[\pi_\theta(a_1 | s_1) Q^{\pi_\theta}(s_1, a_1)\right]\\ &= \sum_{r_1, s_1} p(r_1, s_1 | s_0, a_0) \sum_{a_1} \left(\left(\nabla_\theta \pi_\theta(a_1 | s_1)\right) Q^{\pi_\theta}(s_1, a_1) + \pi_\theta(a_1 | s_1) \nabla_\theta Q^{\pi_\theta}(s_1, a_1)\right)\\ \end{align*}

so that plugging this all back in we get the policy gradient as a sum in which each term θπθ(as)Qπθ(s,a)\nabla_\theta \pi_\theta(a | s) Q^{\pi_\theta}(s, a) has weight equal to the expected number of times that state ss is visited under policy πθ\pi_\theta, which is just off from the expression at the top (which takes an expectation over states) by a constant factor of T.

Full trajectories derivation

This follows the derivation in deep RL notes. policy gradient derivation

Cleaning this up a bit we end up with the expression

θVπθ(s0)=Eτ[t=0Tγt(θlogπθ(atst))Qπθ(st,at)]\nabla_\theta V^{\pi_\theta}(s_0) = \mathbb{E}_\tau \left[\sum_{t=0}^T \gamma^t \left(\nabla_\theta \log \pi_\theta(a_t | s_t)\right) \cdot Q^{\pi_\theta}(s_t, a_t)\right]

which we can see as a variant of the theorem above in which the expectation over state visitations is split into an expectation over trajectories and the empirical sum over states within a trajectory.

Note that it's valid to move between sums of empirical rewards t>tγtrt\sum_{t'>t} \gamma_t r_{t} and their expectations---the Q values Qπθ(st,at)Q^{\pi_\theta}(s_t, a_t)---because this all happens inside of the expectation over st,ats_t, a_t. I was worried that in general rt>tr_{t'>t} is not independent of θlogπθ(atst)\nabla_\theta \log \pi_\theta(a_t | s_t), but that doesn't matter because the latter is just a constant once we condition on (st,at)(s_t, a_t).