temporal difference: Nonlinear Function
Created: March 22, 2022
Modified: April 04, 2022

temporal difference

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

From David Silver's slides: TD-learning 'updates a guess towards a guess'.

Sutton and Barto define the temporal difference error as the difference between the value estimate for the current state and the updated value estimate after transitioning to the next state, including any reward accrued during the transition:

δt=rt+1+γVπ(st+1)Vπ(st)\delta_t = r_{t + 1} + \gamma V^\pi(s_{t+1}) - V^\pi(s_t)

The overall value estimation error across a trajectory can be straightforwardly written as the telescoping sum of TD errors.

The simple learning algorithm that updates the value function to reduce one-step TD error

Vπ(st)Vπ(st)+αδtV^\pi(s_t) \leftarrow V^\pi(s_t) + \alpha \delta_t

is called TD(0). We can also update to reduce nn-step error for arbitrary nn: in general, let

vt(n)=rt+1+γrt+2++γnVπ(st+n)v_t^{(n)} = r_{t+1} + \gamma r_{t + 2} + \ldots + \gamma^n V^\pi(s_{t+n})

be the nn-step return. Then the nn-step TD-learning update can be written

Vπ(St)Vπ(St)+α(vt(n)Vπ(St))V_\pi(S_t) \leftarrow V_\pi(S_t) + \alpha(v_t^{(n)} - V_\pi(S_t))

Letting nn\to\infty we recover the unbiased Monte Carlo value estimate

vt()=vt=rt+1+γrt+2+;v_t^{(\infty)} = v_t = r_{t+1} + \gamma r_{t+2} + \ldots;

smaller values of nn introduce bias to reduce variance (bias-variance tradeoff).

We can even look at the infinite weighted average of future returns

vtλ=(1λ)n=1λn1vt(n)v_t^{\lambda} = (1 - \lambda)\sum_{n=1}^\infty \lambda^{n-1} v^{(n)}_t

and update towards this!For a finite trajectory length TT we would write vtλ=(1λ)n=1Tt1λn1vt(n)+λTt1vtv_t^{\lambda} = (1 - \lambda)\sum_{n=1}^{T-t-1} \lambda^{n-1} v^{(n)}_t + \lambda^{T-t-1} v_t, implicitly using the empirical return vtv_t for all post-termination steps. This is called TD(λ)TD(\lambda), where λ1\lambda\to 1 recovers the unbiased Monte Carlo estimate.

Forward and backward perspectives

Naively, to do a TD(λ\lambda) update we'd need to see 'into the future' all the way to the end of the trajectory. This forward view is nice in theory, but a pain to implement and inefficient because it doesn't learn anything until the very end of the trajectory. In practice we implement the backward view, where at each new state we update all previous state values with the appropriate signal from the current TD error.

Concretely, consider an infinite-length trajectory (st,at,rt+1,st+1,at+1,rt+2,)(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}, r_{t+2}, \ldots) from the policy π\pi. The TD(λ\lambda) update for V(s1)V(s_1) from this trajectory is

V(st)V(st)+α[vtλV(st)]V(s_t) \leftarrow V(s_t) + \alpha\left[v_t^\lambda - V(s_t)\right]

where the change to the value estimate δtλ:=vtλV(st)\delta_t^\lambda := v_t^\lambda - V(s_t) is given by

vtλV(st)=V(st)+(1λ)n=1λn1vt(n)=V(st)+(1λ)n=1λn1(rt+1+γrt+2++γnV(st+n))=V(st)+(1λ)λ0(rt+1+γV(st+1))+(1λ)λ1(rt+1+γrt+2+γ2V(st+2)))+(1λ)λ2(rt+1+γrt+2+γ2rt+3+γ3V(st+3)))  =V(st)+(λγ)0(rt+1+(1λ)γV(st+1))+(λγ)1(rt+2+(1λ)γV(st+2))+(λγ)2(rt+3+(1λ)γV(st+3))  =(λγ)0(rt+1+γV(st+1)V(st))+(λγ)1(rt+2+γV(st+2)V(st+1))+(λγ)2(rt+3+γV(st+3)V(st+2))  =n=0(λγ)nδt+n\begin{alignat*}{2} v_t^\lambda - V(s_t)&= - V(s_t) &&+ (1 - \lambda) \sum_{n=1}^\infty \lambda^{n-1} v_t^{(n)}\\ &= - V(s_t) &&+ (1 - \lambda) \sum_{n=1}^\infty \lambda^{n-1} \left(r_{t+1} + \gamma r_{t+2} + \ldots + \gamma^n V(s_{t+n})\right)\\ &= - V(s_t) &&+ (1 - \lambda) \lambda^0 \left(r_{t+1} + \gamma V(s_{t+1})\right)\\ & &&+ (1 - \lambda)\lambda^1 \left(r_{t+1} + \gamma r_{t+2} + \gamma^2V(s_{t+2}))\right)\\ & &&+ (1 - \lambda)\lambda^2 \left(r_{t+1} + \gamma r_{t+2} + \gamma^2r_{t+3} + \gamma^3V(s_{t+3}))\right)\\ & &&\;\vdots\\ &= - V(s_t) &&+ (\lambda\gamma)^0\left(r_{t+1} + (1-\lambda)\gamma V(s_{t+1})\right)\\ &&&+(\lambda\gamma)^1 \left(r_{t+2} + (1-\lambda)\gamma V(s_{t+2})\right)\\ &&&+(\lambda\gamma)^2 \left(r_{t+3} + (1-\lambda)\gamma V(s_{t+3})\right)\\ &&&\;\vdots\\ &= && (\lambda\gamma)^0\left(r_{t+1} + \gamma V(s_{t+1})- V(s_t)\right)\\ &&&+(\lambda\gamma)^1 \left(r_{t+2} + \gamma V(s_{t+2}) - V(s_{t+1})\right)\\ &&&+(\lambda\gamma)^2 \left(r_{t+3} + \gamma V(s_{t+3}) - V(s_{t+2})\right)\\ &&&\;\vdots\\ &=\sum_{n=0}^\infty (\lambda\gamma&&)^n \delta_{t+n} \end{alignat*}

We see that the update includes terms for all future TD-errors, where the error at step t+nt+n is incorporated into the update at step tt with weight (λγ)n(\lambda\gamma)^n. This justifies the backward view implementation of TD(λ\lambda), in which at each step we update the value estimate for the state from nn steps ago (for all nn) by the current TD error with weight (λγ)n(\lambda\gamma)^n. Note that a given state can appear multiple times in the state history, so its total weight may be a sum; e.g., (λγ)n1+(λγ)n2(\lambda\gamma)^{n_1} + (\lambda\gamma)^{n_2} for a state visited twice, at n1n_1 and n2n_2 steps before the current step respectively. The vector containing these per-state total weights is called an eligibility trace.

Note that online updates introduce non-stationarity: although the forward and backward views give equivalent updates in an offline setting, this is no longer true when we apply the backward updates online, since the update at time tt can change the value estimates at steps >t>t.

We could define a more rigorous version of online TD-learning:

  1. At step 11, update our initial value estimates V0V_0 using δ1\delta_1. Call this updated array of state-values V1V_1.
  2. At step 2, where naive online TD(λ\lambda) would simply update V1V_1 using δ2\delta_2, we instead:
    1. Redo the original V0V1V_0\rightarrow V_1 update now using the two-step return δ1+(λγ)δ2\delta_1 + (\lambda\gamma)\delta_2.
    2. Now recompute δ2\delta_2' using the updated V1V_1', and use that to compute V2.V_2.
  3. Continue, so that at each step tt you iteratively redo all tt updates using the λ\lambda-return at the current horizon.

This algorithm, which Sutton and Barto call the 'online λ\lambda-return algorithm' is computationally impractical, but conceptually nice. Intuitively, at each step we pretend that we've reached the end of the episode and compute the value updates prescribed by forward-view (offline) TD(λ\lambda) for all states in order, each using the updated values from the previous states. Apparently this often performs better in practice than online TD(λ\lambda).

The online λ\lambda-return algorithm can be made practical in 'nice' settings (tabular or linear value function approximation) by formulating it as an online algorithm using a more complicated eligibility trace called a Dutch trace (as opposed to the eligibility traces used by TD(λ\lambda), which are called accumulating traces). Sutton and Barto call this algorithm 'True online TD(λ\lambda)'. Apparently this doesn't work in the general case of nonlinear function approximation, though.

Q-learning

Adapting TD-learning methods to action values gives Q-learning and related methods.