actor-critic: Nonlinear Function
Created: April 17, 2022
Modified: June 22, 2022

actor-critic

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

(references: https://julien-vitay.net/deeprl/ActorCritic.html)

Advantage actor-critic

The advantage function

Aπ(s,a)=Qπ(s,a)Vπ(s)A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)

is a 'centered' version of QQ; in policy gradient methods this corresponds to using the state-value function VV as a control variate to reduce variance. Advantage actor-critic methods estimate the advantage function directly. A standard estimate is the nn-step advantage, which just plugs in the usual nn-step temporal difference estimate of action values:As with other TD estimates, we could average this over multiple values of nn for finer-grained control over the bias-variance tradeoff.

Aϕ(n)(s,a)=k=0n1γkrt+k+1+γnVϕ(st+n+1)Vϕ(st)A^{(n)}_\phi(s,a) = \sum_{k=0}^{n-1} \gamma^k r_{t+k+1}+\gamma^n V_\phi(s_{t+n+1})−V_\phi(s_t)

Parallel / asynchronous actor-critic

(summarizing Minh et al. 2016, Asynchronous Methods for Deep Reinforcement Learning)

A parallel advantage actor-critic (A2C) algorithm is:

  1. Initialize a global actor πθ\pi_\theta and critic VϕV_\phi, and many parallel copies of the environment.
  2. In each environment:
    1. Take nn steps, logging the 'minibatch' of (sk,ak,rk+1)(s_k, a_k, r_{k+1}) tuples. If a terminal state is reached after mnm \le n steps, just pretend that we used the shorter horizon m=nm = n, and reset the environment so that the next minibatch starts at the start state s0s_0.
    2. Compute the TD estimate Qπ(nk)(sk,ak)Q^{(n-k)}_\pi(s_k, a_k) for each k0,,n1k\in {0, \ldots, n-1}. That is: for each state we compute the longest-horizon estimate possible using the minibatch data, so s0s_0 gets an nn-step estimate, s1s_1 gets an (n1)(n-1)-step estimate, and so on.
    3. Accumulate the minibatch actor and critic gradients dθd\theta and dϕd\phi.
  3. Apply the summed gradient update and repeat from step 2.

Note a few design choices here:

  • By using parallel environments we reduce correlation in the state/action pairs. This accomplishes a similar function as experience replay, but allows for on-policy learning.
  • Computing TD estimates of different horizons from an nn-step minibatch effectively averages over kk-step TD methods for 1kn1 \le k \le n.

Minh et al. find that it helps a lot to augment the policy gradient with an entropy regularization term βθH(πθ(st))\beta \nabla_\theta H(\pi_\theta(s_t)) to improve exploration.

The Asynchronous Advantage Actor-Critic (A3C) method generalizes this to allow the parallel actors to run simulations and apply gradient updates asynchronously. (this is an instance of 'HogWild' optimization). They find that it works well, and that it's helpful to share a global RMSProp optimizer state.