Created: July 19, 2022
Modified: July 19, 2022
Modified: July 19, 2022
normalized advantage function
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.References:
- Gu et al., Continuous Deep Q-Learning with Model-based Acceleration (2016).
Instead of modeling directly, we build a network that outputs and quantities defining a quadratic representation of the advantage
which together imply . Using a quadratic to represent the advantage allows us to immediately read off the argmax action . Effectively, is a policy network,