Modified: October 13, 2022
teacher forcing
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.Something that confused for me for a while is that people in certain communities talk about 'teacher forcing' as though it's a trick or a hack, something that had to be invented. But it's actually just the way that maximum-likelihood training works for sequence models.
Considering training an autoregressive model of sequence data (text, audio, action sequences in reinforcement learning, etc.). Such a model has the general form
corresponding to the chain rule of probability. The general goal of training is to find parameters that give the maximum likelihood to a set of training sequences .
As seen in the formula above, the log-likelihood of a training sequence is simply a sum of terms: for each step of the sequence, we compute the conditional log-likelihood of the token given all previous tokens. To evaluate the likelihood of a training sequence, we simply evaluate each term, using the relevant parts of the training sequence. Using the training data as the conditioning information is sometimes called 'teacher forcing', but here we see that it's simply what falls out of optimizing the maximum-likelihood objective for a sequence model. It's actually not obvious what else you would do.