scheduled sampling: Nonlinear Function
Created: October 13, 2022
Modified: October 13, 2022

scheduled sampling

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Scheduled sampling is a training procedure for sequence models that attempts to mitigate exposure bias - the problem in which generation requires the model to continue a sequence that might be outside of its training distribution, producing weird results which are even further outside of the training distribution leading to compounding errors.

The idea, introduced by Bengio, Vinyals, Jaitly, and Shazeer (2015), is to stochastically replace the ttth autoregressive likelihood term logp(xt+1x1:t)\log p(x_{t+1} | x_{1:t}) for the training sequence x1:Tx_{1:T} with a modified term logp(xt+1x~1:t)\log p(x_{t+1} | \tilde{x}_{1:t}), where

x~k={xk (training token) with probability ϵix^k (model-predicted token) otherwise\tilde{x}_k = \left \{\begin{array}{ll}x_k \text{ (training token) with probability } \epsilon_i\\\hat{x}_k \text{ (model-predicted token) otherwise}\end{array}\right.

so that we flip a coin at each point in the sequence to determine whether to use the original training data or fill in a token from the model. A probability of ϵi=1\epsilon_i=1 corresponds to the original (maximum likelihood aka teacher forcing) objective. The probability ϵi\epsilon_i is decreased during training, following a 'curriculum' or 'schedule', thus giving rise to the name.

This procedure seems quite weird: we always target the next token of the observed training sequence, even when we use a model-sampled history which may not fit that token at all. Thus we are really training the model to ignore the history and simply generate the marginal distribution of observed tokens at each timestep. As Ferenc Huszar points out, at probabilities ϵ<1\epsilon < 1 this procedure is inconsistent.

In diffusion models

A variant of the approach, proposed for diffusion models by Deng, Kojima, and Rush (2022), is interesting. Diffusion models are sequence models over latent (noisy) data, but ultimately we care only about the final generated image. This is a case where there really is a coherent notion of how to 'get back on track' if the decoding goes off the rails.

For diffusion models the train/test mismatch is that we train the model to denoise partially-corrupted real-world images --- samples from the partial encoder q(xtx0)q(x_t | x_0), which just adds Gaussian noise to the training image x0x_0) --- but at test time it's asked to denoise partially-generated samples from the model p(xt)p(x_t). The proposal is to train the model to denoise samples that have already been partially denoised: at training time we generate an image a particular noise level by first adding too much noise, then using the forward model to denoise it back to the desired level. In a discrete-time model, a single step of this overshooting-and-backtracking process samples from the distribution

p~(xtx0)=p(xtxt+1)q(xt+1x0)dxt+1.\tilde{p}(x_t | x_0) = \int p(x_t | x_{t+1}) q(x_{t + 1} | x_0) dx_{t+1}.

In principle, we should backpropagate the gradient from denoising this sample into the parameters of the backtracking step p(xtxt1)p(x_t | x_{t-1}), but they omit this term for simplicity.

Is this objective consistent? We could ask this in a strong sense: will a model trained this way