Modified: April 04, 2022
temporal difference
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.From David Silver's slides: TD-learning 'updates a guess towards a guess'.
Sutton and Barto define the temporal difference error as the difference between the value estimate for the current state and the updated value estimate after transitioning to the next state, including any reward accrued during the transition:
The overall value estimation error across a trajectory can be straightforwardly written as the telescoping sum of TD errors.
The simple learning algorithm that updates the value function to reduce one-step TD error
is called TD(0). We can also update to reduce -step error for arbitrary : in general, let
be the -step return. Then the -step TD-learning update can be written
Letting we recover the unbiased Monte Carlo value estimate
smaller values of introduce bias to reduce variance (bias-variance tradeoff).
We can even look at the infinite weighted average of future returns
and update towards this!For a finite trajectory length we would write , implicitly using the empirical return for all post-termination steps. This is called , where recovers the unbiased Monte Carlo estimate.
Forward and backward perspectives
Naively, to do a TD() update we'd need to see 'into the future' all the way to the end of the trajectory. This forward view is nice in theory, but a pain to implement and inefficient because it doesn't learn anything until the very end of the trajectory. In practice we implement the backward view, where at each new state we update all previous state values with the appropriate signal from the current TD error.
Concretely, consider an infinite-length trajectory from the policy . The TD() update for from this trajectory is
where the change to the value estimate is given by
We see that the update includes terms for all future TD-errors, where the error at step is incorporated into the update at step with weight . This justifies the backward view implementation of TD(), in which at each step we update the value estimate for the state from steps ago (for all ) by the current TD error with weight . Note that a given state can appear multiple times in the state history, so its total weight may be a sum; e.g., for a state visited twice, at and steps before the current step respectively. The vector containing these per-state total weights is called an eligibility trace.
Note that online updates introduce non-stationarity: although the forward and backward views give equivalent updates in an offline setting, this is no longer true when we apply the backward updates online, since the update at time can change the value estimates at steps .
We could define a more rigorous version of online TD-learning:
- At step , update our initial value estimates using . Call this updated array of state-values .
- At step 2, where naive online TD() would simply update using , we instead:
- Redo the original update now using the two-step return .
- Now recompute using the updated , and use that to compute
- Continue, so that at each step you iteratively redo all updates using the -return at the current horizon.
This algorithm, which Sutton and Barto call the 'online -return algorithm' is computationally impractical, but conceptually nice. Intuitively, at each step we pretend that we've reached the end of the episode and compute the value updates prescribed by forward-view (offline) TD() for all states in order, each using the updated values from the previous states. Apparently this often performs better in practice than online TD().
The online -return algorithm can be made practical in 'nice' settings (tabular or linear value function approximation) by formulating it as an online algorithm using a more complicated eligibility trace called a Dutch trace (as opposed to the eligibility traces used by TD(), which are called accumulating traces). Sutton and Barto call this algorithm 'True online TD()'. Apparently this doesn't work in the general case of nonlinear function approximation, though.
Q-learning
Adapting TD-learning methods to action values gives Q-learning and related methods.