Created: April 15, 2022
Modified: April 15, 2022

decision transformer

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

paper: Chen, Lu, et al. 2021, https://arxiv.org/abs/2106.01345

Trajectories are represented as sequences:

\hat{R}_1, s_1, a_1, \hat{R}_2, s_2, a_2,\ldots, \hat{R}_T, s_T, a_T

where $\hat{R}_t = \sum_{t'=t^T}r_{t'}$ is the return-to-go, i.e., the future reward (at test time this is initialized to a 'desired' total reward and then decremented over the course of the trajectory). This forces the transformer to learn a connection between actions and future rewards.

A transformer (GPT) encodes these sequences and predicts the actions autoregressively, so $a_1$ is predicted from $(\hat{R}_1, s_1)$ , $a_2$ from $(\hat{R}_1, s_1, a_1, \hat{R}_2, s_2)$ , etc., using up to K previous steps of context. Predicting the states and returns-to-go is possible (and would bring in a model-based RL flavor to the method) but apparently didn't improve performance in their experiments.

This has the shape of a purely supervised method, like behavioral cloning, but it performs much better than naive cloning. It's conceptually similar to doing cloning only on trajectories that achieved the desired return (e.g., if we ask for a return of 1, then we select only trajectories with rewards in $[1-\epsilon, 1 + \epsilon]$ and clone on those), but is better when data are scarce since it can make use of information from other trajectories, and can in some cases extrapolate to achieve better rewards than the policies in the training set.

The decision transformer does quite well at credit assignment and unlike temporal difference learning it's not strongly affected by delayed rewards. This is as we'd expect from the intuition of post-selected behavioral cloning, since the training process gets to look ahead at far-future rewards.

A concurrent paper: Janner, Li, Levine. Trajectory Transformer (2021) https://arxiv.org/abs/2106.02039 uses a similar approach, but models rewards $r_t$ instead of returns-to-go, and predicts both future rewards and states. This requires (/allows) it to use beam search to choose actions, rather than pure sampling.

To read:

Generalized Decision Transformer for Offline Hindsight Information Matching https://openreview.net/forum?id=CAjxVodl_v

decision transformer

Links to this note

AI reflections master

simulator AI

Meta