Created: March 29, 2022
Modified: March 29, 2022

eligibility trace

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

A few ways to think about eligibility traces:

an explicit accounting of credit assignment
a sufficient statistic for the history of the current trajectory
a variant of momentum for reinforcement learning, where we update the value function using an exponentially decaying sum of recent gradients.

Concretely, the eligibility trace $\mathbf{e}_t$ at time $t$ has the same shape as the weight vector $\mathbf{w}_t$ , and the online TD( $\lambda$ ) algorithm updates them as follows:

\begin{align*} \mathbf{e}_{t+1} &= \gamma\lambda\mathbf{e}_t + \nabla_{\mathbf{w}_t} V(s_t; \mathbf{w}_t)\\ \mathbf{w}_{t+1} &= \mathbf{w}_t + \alpha \cdot \delta_t \cdot \mathbf{e}_{t+1} \end{align*}

where $\delta_t = r_{t+1} + \gamma V(s_{t + 1}; \mathbf{w}_t) - V(s_t; \mathbf{w}_t)$ is the temporal difference error in the estimated state value.In the tabular case, where $\mathbf{w}_t = V_t$ is just the table of state values, the update simplifies to $E_{t+1}(s) = \gamma\lambda E_t(s) + \mathbf{1}(s = s_t)$ and $V_{t+1}(s) = V_t(s) + \alpha \cdot \delta_t \cdot E_t(s)$ respectively for all states $s$ .

We see that the eligibility trace vector $\mathbf{e}$ is (by definition) an exponentially decaying sum of value function gradients, allowing us to update the value functions for all recently-observed states in a single update.

eligibility trace

Links to this note

policy gradient

temporal difference

Meta