DAgger: Nonlinear Function
Created: October 13, 2022
Modified: October 13, 2022

DAgger

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

The problem of exposure bias (where an autoregressive sequence model goes off the rails of its training distribution) comes up as a major issue in behavioral cloning algorithms for imitation learning. Here the canonical solution is the DAgger ("Dataset Aggregation") algorithm, which starts by training a naïve behavioral cloning model, then enters an interactive loop:

  1. Sample a (state, action) trajectory from the current trained policy.
  2. Ask the expert to label the optimal action at each visited state.
  3. Add these expert-labeled pairs to the original dataset, retrain the policy on the augmented dataset, and repeat. Thus the training dataset is gradually expanded to include guidance for all states the model might end up at, even those that the expert would never visit on its own. We can think of this as the expert showing the model how to recover from transient cases of distribution shift, so they don't compound.

DAgger works well (provably!) in the RL setting, but the general sequence-modeling setup is different in several ways:

  1. It is often not natural to find an 'expert' who can get bad sequences back on track (when learning the distribution over natural images, we can't just ask Nature to complete an arbitrary partially-generated image for us, especially when that partial generation is itself highly unnatural).
  2. Even if such an expert existed, we train sequence models at scales that would make this impractical.

Another special property of the RL setting is that we often have access to an explicit reward function. This motivates the extension of DAgger to the AggreVaTe ("Aggregate Values to Imitate") algorithm. Instead of collecting tuples (s,a)(s, a^*) indicating the expert action at a given state, AggreVaTe collects cost-weighted tuples (s,a,Q)(s, a^*, Q^*) indicating both the expert action and the expert cost-to-go QQ^*. I don't fully understand how to think about this, but the intuition is that not all deviations from expert actions are equally bad: we should be really sure to match the expert in cases where it matters, like not driving off a cliff, and are potentially willing to give up matching the expert in cases where it doesn't much matter.