eligibility trace: Nonlinear Function
Created: March 29, 2022
Modified: March 29, 2022

eligibility trace

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

A few ways to think about eligibility traces:

  • an explicit accounting of credit assignment
  • a sufficient statistic for the history of the current trajectory
  • a variant of momentum for reinforcement learning, where we update the value function using an exponentially decaying sum of recent gradients.

Concretely, the eligibility trace et\mathbf{e}_t at time tt has the same shape as the weight vector wt\mathbf{w}_t, and the online TD(λ\lambda) algorithm updates them as follows:

et+1=γλet+wtV(st;wt)wt+1=wt+αδtet+1\begin{align*} \mathbf{e}_{t+1} &= \gamma\lambda\mathbf{e}_t + \nabla_{\mathbf{w}_t} V(s_t; \mathbf{w}_t)\\ \mathbf{w}_{t+1} &= \mathbf{w}_t + \alpha \cdot \delta_t \cdot \mathbf{e}_{t+1} \end{align*}

where δt=rt+1+γV(st+1;wt)V(st;wt)\delta_t = r_{t+1} + \gamma V(s_{t + 1}; \mathbf{w}_t) - V(s_t; \mathbf{w}_t) is the temporal difference error in the estimated state value.In the tabular case, where wt=Vt\mathbf{w}_t = V_t is just the table of state values, the update simplifies to Et+1(s)=γλEt(s)+1(s=st)E_{t+1}(s) = \gamma\lambda E_t(s) + \mathbf{1}(s = s_t) and Vt+1(s)=Vt(s)+αδtEt(s)V_{t+1}(s) = V_t(s) + \alpha \cdot \delta_t \cdot E_t(s) respectively for all states ss.

We see that the eligibility trace vector e\mathbf{e} is (by definition) an exponentially decaying sum of value function gradients, allowing us to update the value functions for all recently-observed states in a single update.