Created: October 13, 2022
Modified: October 13, 2022

exposure bias

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Considering training an autoregressive model of sequence data (text, audio, action sequences in reinforcement learning, etc.), which has the general form

p_\theta(x_{1:T}) = p_\theta(x_1) \prod_{t=1}^{T-1} p_\theta(x_{t+1} | x_{1:t})

corresponding to the chain rule of probability. The general goal of training is to find parameters $\theta$ that give the maximum likelihood to a set of training sequences $x^{(1)}, \ldots, x^{(n)}$ .

For any given step $t$ of the sequence, this corresponds to learning a conditional distribution $p_\theta(x_{t+1} | x_{1:t})$ . A problem arises: in all of our training examples, the observed partial sequence $x_{1:t}$ comes from the true (empirical) training distribution (assuming we are training with teacher forcing as is generally necessary). But when we try to generate new sequences at test time, we will be conditioning on $x_{1:t}$ sampled from the model itself. Since models are never perfect, this is in general a different distribution then we trained on. Thus we have covariate shift: the model must generate predictions in a regime where it never saw any training data.

This kind of covariate shift, arising in sequence models, seems to most often be called exposure bias in the literature. It may lead to strange predictions, compounding as we feed these back into the model at future steps, generating even greater covariate shift, causing the generative distribution to go further and further off the rails.

We can attempt to mitigate exposure bias by changes to our training procedure, generation procedure, or both.

At generation time, we use decoding techniques, like beam search or nucleus sampling, to try to avoid 'going off the rails' as we generate sequences.

At training time, we can try to explicitly augment our training data to include examples where the model has 'gone off the rails', and teach it how to get back on track. Attempts at this include:

The DAgger algorithm for imitation learning, which literally includes an expert in the training loop, so they can label what to do in novel situations.
The scheduled sampling algorithm, which effectively says that the way to get back on track is just return to the training example, even if that's an incoherent continuation of whatever came before.

As Ferenc Huszar points out, exposure bias can be seen as a natural consequence of maximum-likelihood training, or equivalently of minimizing $\mathcal{D}_\text{KL}(P\|Q_\theta)$ . This objective prefers models $Q_\theta$ that cover all the modes of the true data distribution $P$ , giving high likelihood to observed sentences, even at the cost of putting some mass in locations outside the support of $P$ . Methods that target other divergences, such as GANs (which optimize the Jenson-Shannon divergence $\frac{1}{2}\mathcal{D}_\text{KL}(P\|Q_\theta) + \frac{1}{2}\mathcal{D}_\text{KL}(Q_\theta \| P)$ ) therefore avoid the issue to some extent because these objectives incentivize the model to never generate an implausible sequence, even at the cost of missing some modes of the training distribution.

exposure bias

Links to this note

DAgger

scheduled sampling

decoding

Meta