teacher forcing: Nonlinear Function
Created: October 13, 2022
Modified: October 13, 2022

teacher forcing

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Something that confused for me for a while is that people in certain communities talk about 'teacher forcing' as though it's a trick or a hack, something that had to be invented. But it's actually just the way that maximum-likelihood training works for sequence models.

Considering training an autoregressive model of sequence data (text, audio, action sequences in reinforcement learning, etc.). Such a model has the general form

logpθ(x1:T)=logpθ(x1)+t=1T1logpθ(xt+1x1:t)\log p_\theta(x_{1:T}) = \log p_\theta(x_1) + \sum_{t=1}^{T-1} \log p_\theta(x_{t+1} | x_{1:t})

corresponding to the chain rule of probability. The general goal of training is to find parameters θ\theta that give the maximum likelihood to a set of training sequences x(1),,x(n)x^{(1)}, \ldots, x^{(n)}.

As seen in the formula above, the log-likelihood of a training sequence is simply a sum of terms: for each step tt of the sequence, we compute the conditional log-likelihood logpθ(xt+1x1:t)\log p_\theta(x_{t+1} | x_{1:t}) of the token xt+1x_{t+1} given all previous tokens. To evaluate the likelihood of a training sequence, we simply evaluate each term, using the relevant parts of the training sequence. Using the training data as the conditioning information x1:tx_{1:t} is sometimes called 'teacher forcing', but here we see that it's simply what falls out of optimizing the maximum-likelihood objective for a sequence model. It's actually not obvious what else you would do.