advantage: Nonlinear Function
Created: July 05, 2022
Modified: July 05, 2022

advantage

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

In reinforcement learning, the advantage of a state-action pair under a policy π\pi is the improvement in value from taking action a (and then acting according to π\pi), versus the value of just immediately following the policy:

Aπ(s,a)=Qπ(s,a)Vπ(s)A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)

The advantage functions like a 'centered' version of the QQ function, isolating the specific effect of the current action, so acting to maximize the advantage A(s,)A(s, \cdot) is equivalent to maximizing the value Q(s,)Q(s, \cdot).

We can immediately notice a few properties:

  1. For an optimal policy π\pi^* the advantage is non-positive: zero if aa is an optimal action in ss, and negative otherwise.
  2. The advantage is equal to an expected temporal difference error:
    Aπ(st,at)=Ert+1,st+1at[rt+1+γVπ(st+1)]Vπ(st)=E[δtst,at]\begin{align*} A^\pi(s_t, a_t) &= \mathbb{E}_{r_{t+1}, s_{t+1}|a_t}\left[r_{t+1} + \gamma V^\pi(s_{t+1})\right] - V^\pi(s_t)\\ &= \mathbb{E}\left[\delta_t | s_t, a_t\right] \end{align*}

Advantage formulation of the RL objective

In RL we generally try to maximize the expected sum of discounted rewards:

J(π)=Eτπ[t=0T1γtrt+1]=Es0[Vπ(s0)]\begin{align*} J(\pi) &= \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^{T-1} \gamma^t r_{t+1} \right]\\ &= \mathbb{E}_{s_0}\left[V^\pi (s_0)\right] \end{align*}

We can reframe this using the telescoping-sum property of advantages. For any two policies π\pi' and π\pi, first note that we can trivially write

J(π)=Eτπ[Vπ(s0)]J(\pi) = \mathbb{E}_{\tau\sim\pi'}\left[V^\pi(s_0)\right]

since the marginal distribution on start states s0s_0 is independent of the policy. Using this, we have:

J(π)J(π)=Eτπ[t=0T1γtrt+1]Eτπ[Vπ(s0)]=Eτπ[t=0T1γtrt+1]+Eτπ[t=1γtVπ(st)t=0γtVπ(st)](note all terms cancel except Vπ(s0))=Eτπ[t=0γt(rt+1+γVπ(st+1)Vπ(st))]=Eτπ[t=0γtAπ(st,at)]Esdπ,γEaπ[Aπ(s,a)]\begin{align*} J(\pi') - J(\pi) &= \mathbb{E}_{\tau \sim \pi'}\left[\sum_{t=0}^{T-1} \gamma^t r_{t+1} \right] - \mathbb{E}_{\tau\sim \pi'}\left[V^\pi(s_0)\right]\\ &=\mathbb{E}_{\tau \sim \pi'}\left[\sum_{t=0}^{T-1} \gamma^t r_{t+1} \right] \\ &\qquad+\mathbb{E}_{\tau\sim \pi'}\left[\sum_{t=1}^\infty \gamma^t V^\pi(s_t) - \sum_{t=0}^\infty \gamma^t V^\pi(s_t)\right]\\ &\qquad\text{(note all terms cancel except $V^\pi(s_0)$)}\\ &=\mathbb{E}_{\tau\sim \pi'}\left[\sum_{t=0}^\infty \gamma^t \left(r_{t+1} + \gamma V^\pi(s_{t+1}) - V^\pi(s_t)\right)\right]\\ &=\mathbb{E}_{\tau\sim \pi'}\left[\sum_{t=0}^\infty \gamma^t A^\pi(s_t, a_t)\right]\\ &\propto\mathbb{E}_{s \sim d_{\pi',\gamma}} \mathbb{E}_{a\sim \pi'}\left[A^\pi(s, a)\right] \end{align*}

showing that the performance of policy π\pi' can be measured (up to an additive constant J(π)J(\pi)) as the expected advantage under any policy of actions taken by π\pi'. It doesn't matter what policy's advantages we use, because in the definition of advantage as expected temporal difference error:

Aπ(s,a)=Qπ(s,a)Vπ(s)=Er,s[r+γVπ(s)]Vπ(s)\begin{align*} A^\pi(s, a) &= Q^\pi(s, a) - V^\pi(s)\\ &= \mathbb{E}_{r,s'}\left[r + \gamma V^\pi(s')\right] - V^\pi(s) \end{align*}

the immediate expectation is just with respect to environmental randomness (the reward rr and next state ss'), and the remaining expectations inside of VπV^\pi all cancel out in the telescoping sum, except for the value of the initial state which gives the additive constant J(π)J(\pi).