Created: April 17, 2022
Modified: June 22, 2022

actor-critic

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

(references: https://julien-vitay.net/deeprl/ActorCritic.html)

Advantage actor-critic

The advantage function

A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)

is a 'centered' version of $Q$ ; in policy gradient methods this corresponds to using the state-value function $V$ as a control variate to reduce variance. Advantage actor-critic methods estimate the advantage function directly. A standard estimate is the $n$ -step advantage, which just plugs in the usual $n$ -step temporal difference estimate of action values:As with other TD estimates, we could average this over multiple values of $n$ for finer-grained control over the bias-variance tradeoff.

A^{(n)}_\phi(s,a) = \sum_{k=0}^{n-1} \gamma^k r_{t+k+1}+\gamma^n V_\phi(s_{t+n+1})−V_\phi(s_t)

Parallel / asynchronous actor-critic

(summarizing Minh et al. 2016, Asynchronous Methods for Deep Reinforcement Learning)

A parallel advantage actor-critic (A2C) algorithm is:

Initialize a global actor $\pi_\theta$ and critic $V_\phi$ , and many parallel copies of the environment.
In each environment:
1. Take $n$ steps, logging the 'minibatch' of $(s_k, a_k, r_{k+1})$ tuples. If a terminal state is reached after $m \le n$ steps, just pretend that we used the shorter horizon $m = n$ , and reset the environment so that the next minibatch starts at the start state $s_0$ .
2. Compute the TD estimate $Q^{(n-k)}_\pi(s_k, a_k)$ for each $k\in {0, \ldots, n-1}$ . That is: for each state we compute the longest-horizon estimate possible using the minibatch data, so $s_0$ gets an $n$ -step estimate, $s_1$ gets an $(n-1)$ -step estimate, and so on.
3. Accumulate the minibatch actor and critic gradients $d\theta$ and $d\phi$ .
Apply the summed gradient update and repeat from step 2.

Note a few design choices here:

By using parallel environments we reduce correlation in the state/action pairs. This accomplishes a similar function as experience replay, but allows for on-policy learning.
Computing TD estimates of different horizons from an $n$ -step minibatch effectively averages over $k$ -step TD methods for $1 \le k \le n$ .

Minh et al. find that it helps a lot to augment the policy gradient with an entropy regularization term $\beta \nabla_\theta H(\pi_\theta(s_t))$ to improve exploration.

The Asynchronous Advantage Actor-Critic (A3C) method generalizes this to allow the parallel actors to run simulations and apply gradient updates asynchronously. (this is an instance of 'HogWild' optimization). They find that it works well, and that it's helpful to share a global RMSProp optimizer state.

actor-critic

Advantage actor-critic

Parallel / asynchronous actor-critic

Links to this note

policy gradient

advice is hard to take

P != NP

generative vs discriminative modeling

Meta