Created: March 22, 2022
Modified: April 04, 2022

temporal difference

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

From David Silver's slides: TD-learning 'updates a guess towards a guess'.

Sutton and Barto define the temporal difference error as the difference between the value estimate for the current state and the updated value estimate after transitioning to the next state, including any reward accrued during the transition:

\delta_t = r_{t + 1} + \gamma V^\pi(s_{t+1}) - V^\pi(s_t)

The overall value estimation error across a trajectory can be straightforwardly written as the telescoping sum of TD errors.

The simple learning algorithm that updates the value function to reduce one-step TD error

V^\pi(s_t) \leftarrow V^\pi(s_t) + \alpha \delta_t

is called TD(0). We can also update to reduce $n$ -step error for arbitrary $n$ : in general, let

v_t^{(n)} = r_{t+1} + \gamma r_{t + 2} + \ldots + \gamma^n V^\pi(s_{t+n})

be the $n$ -step return. Then the $n$ -step TD-learning update can be written

V_\pi(S_t) \leftarrow V_\pi(S_t) + \alpha(v_t^{(n)} - V_\pi(S_t))

Letting $n\to\infty$ we recover the unbiased Monte Carlo value estimate

v_t^{(\infty)} = v_t = r_{t+1} + \gamma r_{t+2} + \ldots;

smaller values of $n$ introduce bias to reduce variance (bias-variance tradeoff).

We can even look at the infinite weighted average of future returns

v_t^{\lambda} = (1 - \lambda)\sum_{n=1}^\infty \lambda^{n-1} v^{(n)}_t

and update towards this!For a finite trajectory length $T$ we would write $v_t^{\lambda} = (1 - \lambda)\sum_{n=1}^{T-t-1} \lambda^{n-1} v^{(n)}_t + \lambda^{T-t-1} v_t$ , implicitly using the empirical return $v_t$ for all post-termination steps. This is called $TD(\lambda)$ , where $\lambda\to 1$ recovers the unbiased Monte Carlo estimate.

Forward and backward perspectives

Naively, to do a TD( $\lambda$ ) update we'd need to see 'into the future' all the way to the end of the trajectory. This forward view is nice in theory, but a pain to implement and inefficient because it doesn't learn anything until the very end of the trajectory. In practice we implement the backward view, where at each new state we update all previous state values with the appropriate signal from the current TD error.

Concretely, consider an infinite-length trajectory $(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}, r_{t+2}, \ldots)$ from the policy $\pi$ . The TD( $\lambda$ ) update for $V(s_1)$ from this trajectory is

V(s_t) \leftarrow V(s_t) + \alpha\left[v_t^\lambda - V(s_t)\right]

where the change to the value estimate $\delta_t^\lambda := v_t^\lambda - V(s_t)$ is given by

\begin{alignat*}{2} v_t^\lambda - V(s_t)&= - V(s_t) &&+ (1 - \lambda) \sum_{n=1}^\infty \lambda^{n-1} v_t^{(n)}\\ &= - V(s_t) &&+ (1 - \lambda) \sum_{n=1}^\infty \lambda^{n-1} \left(r_{t+1} + \gamma r_{t+2} + \ldots + \gamma^n V(s_{t+n})\right)\\ &= - V(s_t) &&+ (1 - \lambda) \lambda^0 \left(r_{t+1} + \gamma V(s_{t+1})\right)\\ & &&+ (1 - \lambda)\lambda^1 \left(r_{t+1} + \gamma r_{t+2} + \gamma^2V(s_{t+2}))\right)\\ & &&+ (1 - \lambda)\lambda^2 \left(r_{t+1} + \gamma r_{t+2} + \gamma^2r_{t+3} + \gamma^3V(s_{t+3}))\right)\\ & &&\;\vdots\\ &= - V(s_t) &&+ (\lambda\gamma)^0\left(r_{t+1} + (1-\lambda)\gamma V(s_{t+1})\right)\\ &&&+(\lambda\gamma)^1 \left(r_{t+2} + (1-\lambda)\gamma V(s_{t+2})\right)\\ &&&+(\lambda\gamma)^2 \left(r_{t+3} + (1-\lambda)\gamma V(s_{t+3})\right)\\ &&&\;\vdots\\ &= && (\lambda\gamma)^0\left(r_{t+1} + \gamma V(s_{t+1})- V(s_t)\right)\\ &&&+(\lambda\gamma)^1 \left(r_{t+2} + \gamma V(s_{t+2}) - V(s_{t+1})\right)\\ &&&+(\lambda\gamma)^2 \left(r_{t+3} + \gamma V(s_{t+3}) - V(s_{t+2})\right)\\ &&&\;\vdots\\ &=\sum_{n=0}^\infty (\lambda\gamma&&)^n \delta_{t+n} \end{alignat*}

We see that the update includes terms for all future TD-errors, where the error at step $t+n$ is incorporated into the update at step $t$ with weight $(\lambda\gamma)^n$ . This justifies the backward view implementation of TD( $\lambda$ ), in which at each step we update the value estimate for the state from $n$ steps ago (for all $n$ ) by the current TD error with weight $(\lambda\gamma)^n$ . Note that a given state can appear multiple times in the state history, so its total weight may be a sum; e.g., $(\lambda\gamma)^{n_1} + (\lambda\gamma)^{n_2}$ for a state visited twice, at $n_1$ and $n_2$ steps before the current step respectively. The vector containing these per-state total weights is called an eligibility trace.

Note that online updates introduce non-stationarity: although the forward and backward views give equivalent updates in an offline setting, this is no longer true when we apply the backward updates online, since the update at time $t$ can change the value estimates at steps $>t$ .

We could define a more rigorous version of online TD-learning:

At step $1$ , update our initial value estimates $V_0$ using $\delta_1$ . Call this updated array of state-values $V_1$ .
At step 2, where naive online TD( $\lambda$ ) would simply update $V_1$ using $\delta_2$ , we instead:
1. Redo the original $V_0\rightarrow V_1$ update now using the two-step return $\delta_1 + (\lambda\gamma)\delta_2$ .
2. Now recompute $\delta_2'$ using the updated $V_1'$ , and use that to compute $V_2.$
Continue, so that at each step $t$ you iteratively redo all $t$ updates using the $\lambda$ -return at the current horizon.

This algorithm, which Sutton and Barto call the 'online $\lambda$ -return algorithm' is computationally impractical, but conceptually nice. Intuitively, at each step we pretend that we've reached the end of the episode and compute the value updates prescribed by forward-view (offline) TD( $\lambda$ ) for all states in order, each using the updated values from the previous states. Apparently this often performs better in practice than online TD( $\lambda$ ).

The online $\lambda$ -return algorithm can be made practical in 'nice' settings (tabular or linear value function approximation) by formulating it as an online algorithm using a more complicated eligibility trace called a Dutch trace (as opposed to the eligibility traces used by TD( $\lambda$ ), which are called accumulating traces). Sutton and Barto call this algorithm 'True online TD( $\lambda$ )'. Apparently this doesn't work in the general case of nonlinear function approximation, though.

Q-learning

Adapting TD-learning methods to action values gives Q-learning and related methods.

temporal difference

Forward and backward perspectives

Q-learning

Links to this note

deep deterministic policy gradient

advantage

experience replay

reinforcement learning notation

target network

maximum-entropy reinforcement learning

actor-critic

decision transformer

rl diagnostics

values all the way down

Q-learning

eligibility trace

policy gradient

state values, then action values

Meta