Created: April 23, 2022
Modified: April 23, 2022

target network

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

A general issue with temporal difference learning methods, which 'update a guess towards a guess', is that they can end up 'chasing their own tails' because the target is constantly changing. The solution to this is to keep a second copy of the network parameters $\theta'$ which changes more slowly than the current parameters $\theta$ , and use these as the target in the TD error:

\delta_t = \left(r_{t+1} + V_{\theta'}(s_{t+1})\right) - V_\theta(s_t)

(with analogous expressions for multistep errors and for $Q$ -learning).

In the simplest case, the target network parameters are frozen and updated infrequently to copy the current parameters ( $\theta' \leftarrow \theta$ ): the DQN paper does this every 10000 steps. More elegant is to use polyak averaging, updating

\theta'_{t} = \alpha\theta_{t-1} + (1 - \alpha)\theta_t

at each step so that the target is a slowly-evolving average of previous steps.

target network

Links to this note

Q-learning

policy gradient

Meta