decoding: Nonlinear Function
Created: July 02, 2022
Modified: July 02, 2022

decoding

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

References:

How should we actually generate text from a neural language (sequence) model?

Formally, suppose an autoregressive model that represents the probability of a sequence x=x1,,xNx = {x_1, \ldots, x_N} as a product of conditional probabilities:

p(x)=n=1Np(xnx:n1)p(x) = \prod_{n=1}^N p(x_n | x_{:n-1})

We'll treat these conditional probabilities are given, fixed by the model and any prompt text that we may have provided (since without loss of generality we can just fold the effect of a prompt into the conditional probabilities).

The obvious strategy is pure sampling: we simply sample from p(x)p(x) by sampling from each conditional distribution in succession. This would be the right thing to do if we had access to the 'true' distribution of natural language phuman(x)p_\text{human}(x) (supposing for the moment that such a thing exists1. But in practice it often generates incoherent text. Why would this be?

The basic problem is that all models are wrong. In particular, our model p(x)p(x) will inevitably sample a token that a human wouldn't have produced. Then things will go off the rails, because the model is now operating outside of its training regime, so future productions can be arbitrarily weird (this is related to the motivation for exposure bias at training time, and is an instance of the more general issue that comes up in behavioral cloning). If the model is decently calibrated, it should have more uncertainty about productions in these circumstances, so each successive token will be sampled from a higher-entropy distribution (thus generating higher perplexity) than if it were still within the training domain, producing samples with higher perplexity than human-generated text.

How can we keep the model 'on the rails'? One intuition is that we're more likely to find errors in the tails of the distribution, because low-probability outcomes are observed less often in the training data. So maybe we should focus on higher-probability outcomes.

The extreme version of this is maximum-likelihood decoding, where we try to find the single most probable x under p(x)p(x). This is usually not feasible to do exactly, since the space of possible sequences is combinatorically large, but we can approximate it by various algorithms such as greedy sampling or beam search. In practice, however, the results are often repetitive and have much lower perplexity than human-generated text.

The nucleus sampling paper observes that human-written text is generally not the most probable text, because language needs to be surprising in order to convey information. Of course, the true model phuman(x)p_\text{human}(x) would take this into account, so that the maximum-likelihood sample would be reasonable, but empirically this doesn't carry over to actual models. Presumably the maximization will tend to exploit any imperfections in our model.

Perplexity as a goal: the nucleus sampling paper also claims that we want generations that match the perplexity of human-generated text under the model. Certainly, matching the distribution of human-generated text should mean matching all statistics of that distribution, of which perplexity under any particular model is just one example.

Some tools we can use to calibrate perplexity:

Temperature scaling:

**Top-kk sampling:

Reranking:

Nucleus sampling: introduced by Holtzman et al., 2020. Instead of using a fixed top kk, we restrict to however many of the most-likely possibilities we need to get to some cumulative probability pp. For example, given p=0.9p=0.9, if some single word has probability 0.90.9 we would choose that word deterministically, while if there are 90 words each with probability 0.010.01 (plus a tail of lower-probability words), we would sample from that set.

Q: is there a 'correct' approach to drawing samples under model mismatch? Not sure the right formalism. But: if nucleus sampling is the answer, what is the question?

[1]: Is the true distribution of natural language even a well-defined quantity? Even supposing we had a dataset of all language that humans have ever produced, spoken or written, this would cover only a small fraction of the language that could be produced - to generalize fully we would need to model the actual data-generating mechanism, but the true data-generating mechanism is just human life, including the memories, conceptual frameworks, emotions, goals, plans, and relationships of all humans, as well as the state of the physical world that they inhabit. all models are wrong, including language models, because the only way to be 'right' would be to have a map as big as the territory. But maybe that territory projects onto a true set of conditional probabilities for language, so that a conceptually valid distribution phumanp_\text{human} exists even if we can't learn it?