Created: October 07, 2022
Modified: February 13, 2023

tokenize

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

How should a machine learning model represent text? Word-level and character-level features are obvious options, but both have drawbacks.

Using words as atomic elements is attractive since words have inherent semantics. The downsides are that this leads to a large vocabulary size and inhibits generalization, since even closely related forms (help, helps, helped) will be seen by the model as totally different tokens, each with its own meaning that must be learned from scratch. At test-time, words that didn't appear in the training corpus must be represented by an unknown token UNK, which provides no useful information.

On the other hand, a character-level representation allows the model to 'see deeper' into the text, potentially generalizing across related words by associating meanings with shared substrings. And there's no such thing as an unknown character: the set of possible characters is fixed, and much smaller than the set of possible words. On the other hand, individual characters in Western alphabets don't have semantics of their own, and using characters as tokens requires many more tokens to represent a given text.

Modern models tend to use approaches that are somewhere in the middle.

Byte-pair tokenization is the method used by GPT and related models. BPE tends to learn chunks corresponding to common subwords.

To 'train' a BPE tokenizer:

Initialize the vocabulary to include all bytes that appear in the training corpus. These may represent individual characters (ASCII) or even parts of characters (Unicode).
'Pre-tokenize' the corpus by splitting it into individual words. This will reduce the complexity from $O(n^2)$ in the corpus size to $O(n^2)$ in the length of the longest word, which is dramatically better. The cost is that BPE will learn 'at most' a word-level tokenization, which is not a big deal.
Find the sequential pair of tokens from the current vocabulary that occurs most often in the list of training words, and add the merged pair to the vocabulary. For example, if this is the pair ['t', 'h'], we'd add a new merged token th to the vocabulary.
Repeat until the desired number of merges has occurred, or until we hit a desired vocabulary size (note that the vocabulary size may increase, decrease, or stay the same after a merge, depending on whether the component tokens occur individually in the corpus).

(note that here and below I'm just trying to give a conceptual account of the tokenization; optimized implementations may structure their computation differently from what I describe).

There are a few ways of thinking about encoding new text using the trained token set, corresponding to different forms of greedy algorithms:

Recursively apply the merge rules learned during training, in the order that they were learned. (thus the test-time parsing matches the training)
Recursively pick out the longest possible token that occurs as a substring of each word.
Recursively pick out the longest possible token that occurs as a prefix of each word.
etc.

The encoding problem is actually ambiguous. Suppose we want to encode the string penisland using the vocabulary ['pen', 'island', 'penis', 'land']; depending how we do it we might end up with penis-land or pen-island, which have very different semantics.

Unigram LM tokenization. This method is a bit more principled than byte-pair encoding; there is some recent evidence (Kudo, 2018, Bostrom and Durrett, 2020) that its tokenizations are more linguistically sound and perform better on downstream tasks.

To train the tokenizer:

After pre-tokenizing the corpus into individual words, initialize the vocabulary to contain all substrings of all words that exist in the corpus. For example, if the corpus consisted only of the words ['to', 'be'], we'd initialize the vocabulary as ['t', 'o', 'b', 'e', 'to', 'be']. We also keep track of the frequency of each token in the corpus, so we can compute probabilities (this is the 'unigram language model').
Compute the probability of each word in the corpus under this tokenization - note that a word may have multiple segmentations, but we can marginalize over these with the HMM forward algorithm to get an overall probability. Multiply these (weighted by word frequency) to get the probability of the whole corpus.
For each token, also compute the probability of the corpus if that token were removed from the vocabulary. Then remove the token whose removal decreases the corpus probability by the smallest amount (in practice, you'd speed things up by removing the top K such tokens).
Repeat until reaching the target vocabulary size.

Given a trained vocabulary (including probabilities for each token), we encode new text using the Viterbi algorithm to find the highest-probability token sequence, which is in general uniquely defined (outside of toy cases).

Thoughts: it's a little unsatisfying that tokenization is a separate step from the model itself. Even a 'good' tokenization still creates semantic ambiguity; it commits to one of potentially several possible views of a word, usually without the benefit of context.

From a statistical standpoint, ideally we'd present the model with several tokenizations (even up to a word-level view and down to a character-level view) and let the model itself decide which of these to attend to. This seems probably infeasible in practice, because with soft attention you still pay the cost for modeling all of these tokenizations (at which point you might as well just stick to the character-level model); meanwhile, hard attention sacrifices both gradients and batchability.

research idea: given infinite compute, with vanilla transformers, are character-level models statistically optimal? Or is there a clever way to architect the lower layers to 'build in' a prior for a tokenizer-like structure that turns out to be statistically helpful?

tokenize

Meta