Created: July 05, 2022
Modified: July 05, 2022

advantage

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

In reinforcement learning, the advantage of a state-action pair under a policy $\pi$ is the improvement in value from taking action a (and then acting according to $\pi$ ), versus the value of just immediately following the policy:

A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)

The advantage functions like a 'centered' version of the $Q$ function, isolating the specific effect of the current action, so acting to maximize the advantage $A(s, \cdot)$ is equivalent to maximizing the value $Q(s, \cdot)$ .

We can immediately notice a few properties:

For an optimal policy $\pi^*$ the advantage is non-positive: zero if $a$ is an optimal action in $s$ , and negative otherwise.
The advantage is equal to an expected temporal difference error: $\begin{align*} A^\pi(s_t, a_t) &= \mathbb{E}_{r_{t+1}, s_{t+1}|a_t}\left[r_{t+1} + \gamma V^\pi(s_{t+1})\right] - V^\pi(s_t)\\ &= \mathbb{E}\left[\delta_t | s_t, a_t\right] \end{align*}$

Advantage formulation of the RL objective

In RL we generally try to maximize the expected sum of discounted rewards:

\begin{align*} J(\pi) &= \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^{T-1} \gamma^t r_{t+1} \right]\\ &= \mathbb{E}_{s_0}\left[V^\pi (s_0)\right] \end{align*}

We can reframe this using the telescoping-sum property of advantages. For any two policies $\pi'$ and $\pi$ , first note that we can trivially write

J(\pi) = \mathbb{E}_{\tau\sim\pi'}\left[V^\pi(s_0)\right]

since the marginal distribution on start states $s_0$ is independent of the policy. Using this, we have:

\begin{align*} J(\pi') - J(\pi) &= \mathbb{E}_{\tau \sim \pi'}\left[\sum_{t=0}^{T-1} \gamma^t r_{t+1} \right] - \mathbb{E}_{\tau\sim \pi'}\left[V^\pi(s_0)\right]\\ &=\mathbb{E}_{\tau \sim \pi'}\left[\sum_{t=0}^{T-1} \gamma^t r_{t+1} \right] \\ &\qquad+\mathbb{E}_{\tau\sim \pi'}\left[\sum_{t=1}^\infty \gamma^t V^\pi(s_t) - \sum_{t=0}^\infty \gamma^t V^\pi(s_t)\right]\\ &\qquad\text{(note all terms cancel except $V^\pi(s_0)$)}\\ &=\mathbb{E}_{\tau\sim \pi'}\left[\sum_{t=0}^\infty \gamma^t \left(r_{t+1} + \gamma V^\pi(s_{t+1}) - V^\pi(s_t)\right)\right]\\ &=\mathbb{E}_{\tau\sim \pi'}\left[\sum_{t=0}^\infty \gamma^t A^\pi(s_t, a_t)\right]\\ &\propto\mathbb{E}_{s \sim d_{\pi',\gamma}} \mathbb{E}_{a\sim \pi'}\left[A^\pi(s, a)\right] \end{align*}

showing that the performance of policy $\pi'$ can be measured (up to an additive constant $J(\pi)$ ) as the expected advantage under any policy of actions taken by $\pi'$ . It doesn't matter what policy's advantages we use, because in the definition of advantage as expected temporal difference error:

\begin{align*} A^\pi(s, a) &= Q^\pi(s, a) - V^\pi(s)\\ &= \mathbb{E}_{r,s'}\left[r + \gamma V^\pi(s')\right] - V^\pi(s) \end{align*}

the immediate expectation is just with respect to environmental randomness (the reward $r$ and next state $s'$ ), and the remaining expectations inside of $V^\pi$ all cancel out in the telescoping sum, except for the value of the initial state which gives the additive constant $J(\pi)$ .

advantage

Advantage formulation of the RL objective

Links to this note

reinforcement learning notation

policy gradient

Meta