In reinforcement learning, the advantage of a state-action pair under a policy π is the improvement in value from taking action a (and then acting according to π), versus the value of just immediately following the policy:
Aπ(s,a)=Qπ(s,a)−Vπ(s) The advantage functions like a 'centered' version of the Q function, isolating the specific effect of the current action, so acting to maximize the advantage A(s,⋅) is equivalent to maximizing the value Q(s,⋅).
We can immediately notice a few properties:
- For an optimal policy π∗ the advantage is non-positive: zero if a is an optimal action in s, and negative otherwise.
- The advantage is equal to an expected temporal difference error:
Aπ(st,at)=Ert+1,st+1∣at[rt+1+γVπ(st+1)]−Vπ(st)=E[δt∣st,at]
In RL we generally try to maximize the expected sum of discounted rewards:
J(π)=Eτ∼π[t=0∑T−1γtrt+1]=Es0[Vπ(s0)] We can reframe this using the telescoping-sum property of advantages. For any two policies π′ and π, first note that we can trivially write
J(π)=Eτ∼π′[Vπ(s0)] since the marginal distribution on start states s0 is independent of the policy. Using this, we have:
J(π′)−J(π)=Eτ∼π′[t=0∑T−1γtrt+1]−Eτ∼π′[Vπ(s0)]=Eτ∼π′[t=0∑T−1γtrt+1]+Eτ∼π′[t=1∑∞γtVπ(st)−t=0∑∞γtVπ(st)](note all terms cancel except Vπ(s0))=Eτ∼π′[t=0∑∞γt(rt+1+γVπ(st+1)−Vπ(st))]=Eτ∼π′[t=0∑∞γtAπ(st,at)]∝Es∼dπ′,γEa∼π′[Aπ(s,a)] showing that the performance of policy π′ can be measured (up to an additive constant J(π)) as the expected advantage under any policy of actions taken by π′. It doesn't matter what policy's advantages we use, because in the definition of advantage as expected temporal difference error:
Aπ(s,a)=Qπ(s,a)−Vπ(s)=Er,s′[r+γVπ(s′)]−Vπ(s) the immediate expectation is just with respect to environmental randomness (the reward r and next state s′), and the remaining expectations inside of Vπ all cancel out in the telescoping sum, except for the value of the initial state which gives the additive constant J(π).