experience replay: Nonlinear Function
Created: April 23, 2022
Modified: September 04, 2022

experience replay

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

The state transitions we observe in reinforcement learning are typically correlated over time, both

  • within a trajectory (obviously) and
  • across trajectories (since the policy changes only slightly from one trajectory to the next).

But stochastic gradient methods require iid samples. To bridge this gap and stabilize learning we can use a replay buffer, which stores a large number of transition tuples (s,a,r,s)(s, a, r, s'). Instead of updating immediately on new transitions, we add them to the buffer, kicking out the oldest transition, and then for each update we sample (a batch of) transitions from the buffer uniformly at random.

A nice thing about using a buffer is that it decouples the process of collecting experience from the training process. They can run in different threads or on different machines, and if collecting experience is expensive, we can squeeze more 'juice' out of the data by taking multiple gradient steps for each transition we collect. The flip side of this is that replay-based learning is inherently off-policy, since the buffer was populated by older versions of the current policy.

Prioritized experience replay

(reference: Schaul et al., 2015, https://arxiv.org/abs/1511.05952)

Not all transitions are equally valuable. We might do better to focus on the 'surprising' ones: the hard cases where our current estimates are really bad.

One approach to prioritization is use the temporal difference error. We store this error δi\delta_i in the buffer along with each transition ii, and update it whenever we take a training step with that transition.

Naively we might consider a greedy prioritization scheme where we always replay the highest-error transition. But this has several flaws:

  • We'll tend to replay the same transition over and over. This is highly nonstationary and can cause us to overfit to those transitions.
  • The TD errors are usually noisy, due to stochastic transitions and/or rewards, so we shouldn't trust them too much.
  • The stored TD errors will be stale in general, since they don't reflect whatever we might have learned since they were last updated, and greedy prioritization means we might never revisit some transitions to correct their errors. We can address these issues using a stochastic prioritization scheme that gives nonzero probability to every transition. The paper proposes sampling transitions with probabilities PipiαP_i \propto p_i^\alpha, where α\alpha is an (inverse) temperature parameter and pi=δi+ϵp_i = |\delta_i| + \epsilon or pi=1/rank(i)p_i = 1/\text{rank}(i) is a positive-valued priority.

Prioritized replay also introduces bias, since the replay distribution no longer matches the state-visitation distribution. This is corrected using weighted importance sampling, with the weight

wi=(1N1Pi)β.w_i = \left(\frac{1}{N} \frac{1}{P_i}\right)^\beta.

used to multiply δi\delta_i in the update

θθλwiδiθQθ(si,ai)\theta \leftarrow \theta - \lambda w_i \delta_i \nabla_\theta Q_\theta(s_i, a_i)

where it effectively acts as an auxiliary dynamic learning-rate parameter (in addition to the usual parameter λ\lambda). For stability, the weights are normalized by 1/maxiwi1/\max_i w_i so that they can only decrease gradient magnitude.

The updates approach unbiasedness as β1\beta \to 1, though this also increases variance. Since early training is crazy biased and nonstationary anyway, the paper suggests linearly annealing from an initial value β0<1\beta_0 < 1 to β=1\beta=1 so that the updates become unbiased over the course of the training process. On Atari games, the paper says:

For the α and β0 hyperparameters that are introduced by prioritization, we did a coarse grid search (evaluated on a subset of 8 games), and found the sweet spot to be α = 0.7, β0 = 0.5 for the rank-based variant and α = 0.6, β0 = 0.4 for the proportional variant.

Efficient implementation of prioritized replay requires some thinking about data structures; one approach is described in the paper (Schaul et al., 2015).