Created: October 13, 2022
Modified: October 13, 2022

teacher forcing

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Something that confused for me for a while is that people in certain communities talk about 'teacher forcing' as though it's a trick or a hack, something that had to be invented. But it's actually just the way that maximum-likelihood training works for sequence models.

Considering training an autoregressive model of sequence data (text, audio, action sequences in reinforcement learning, etc.). Such a model has the general form

\log p_\theta(x_{1:T}) = \log p_\theta(x_1) + \sum_{t=1}^{T-1} \log p_\theta(x_{t+1} | x_{1:t})

corresponding to the chain rule of probability. The general goal of training is to find parameters $\theta$ that give the maximum likelihood to a set of training sequences $x^{(1)}, \ldots, x^{(n)}$ .

As seen in the formula above, the log-likelihood of a training sequence is simply a sum of terms: for each step $t$ of the sequence, we compute the conditional log-likelihood $\log p_\theta(x_{t+1} | x_{1:t})$ of the token $x_{t+1}$ given all previous tokens. To evaluate the likelihood of a training sequence, we simply evaluate each term, using the relevant parts of the training sequence. Using the training data as the conditioning information $x_{1:t}$ is sometimes called 'teacher forcing', but here we see that it's simply what falls out of optimizing the maximum-likelihood objective for a sequence model. It's actually not obvious what else you would do.

teacher forcing

Links to this note

exposure bias

scheduled sampling

Meta