normalized advantage function: Nonlinear Function
Created: July 19, 2022
Modified: July 19, 2022

normalized advantage function

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

References:

Instead of modeling Q(s,a)Q(s, a) directly, we build a network that outputs V(s)V(s) and quantities μ(s),P(s)\mu(s), P(s) defining a quadratic representation of the advantage

A(s,a)=12(aμ(s))TP(s)(aμ(s))A(s, a) = -\frac{1}{2}(a - \mu(s))^T P(s) (a - \mu(s))

which together imply Q(s,a)=V(s)+A(s,a)Q(s, a) = V(s) + A(s, a). Using a quadratic to represent the advantage allows us to immediately read off the argmax action μ(s)\mu(s). Effectively, μ(s)\mu(s) is a policy network,