Anatomy of a Tokenizer from First Principles

Why Transformers need integers, and how Unicode, words, UTF-8 bytes, and Byte Pair Encoding each solve one piece of the problem.

Abstract technical illustration of glyph fragments and data cells flowing through a tokenizer pipeline.

The core technology behind almost every Large Language Model is the Transformer, a deep learning architecture introduced in 2017 that kickstarted the modern AI revolution we see today. What you may be surprised to learn is that Transformers do not operate on text. It operates on sequences of integers.

Here’s an analogy. Imagine you have this extraordinarily smart friend who only understands Mandarin. You, on the other hand, only understand English. You would like to ask your smart friend questions. To do this, you first need a way to translate your English into Mandarin (and their answers back into English).

Now, instead of a smart friend, you have a Transformer, and instead of Mandarin, the only language it understands is sequences of integers. To ask it a question, you need something that turns your text into integers. That something is a tokenizer. A tokenizer turns text into a sequence of integers, where each integer is the token ID of a token.

Text passing through a tokenizer and becoming a sequence of integers. — Text converted into token IDs

It is important to understand that the integers in the sequence are understood to be labels rather than quantities. It helps to think of the sequence less as a list of numbers and more as a language the Transformer understands. Like any language, the token’s meaning comes from position in the sequence and the tokens around it, not from the magnitude of the integer itself.

Every mapping from text to sequence of integers must satisfy two properties. First, the mapping must be deterministic: the same text must always produce the same sequence of integers. Second, it must be injective: no two different texts may map to the same sequence of integers. Injectivity is what lets us reverse the process and decode the model’s integer outputs back into text without ambiguity.

This gives us a precise question: what is the best way to build a deterministic, injective mapping from text to sequence of integers?

Through answering this question, we are able to build the modern tokenizer from first principles. We’ll start with the most basic idea, examine its drawbacks, and keep on iterating. Through this process, you’ll be able to understand why every component exists (and what breaks when we take it away).

Attempt #1: Just use Unicode

The problem is to come up with a deterministic, injective mapping from text to sequence of integers. It turns out, text is made up of characters, and every character already has a number. This is Unicode. It basically is a giant lookup table that assigns a unique integer (a code point) to essentially every character humanity has ever written down, from Latin letters to Chinese hanzi to even emojis.

The lowercase letter c is code point 99, a is 97, and t is 116. Converting text to integers becomes very easy: look each character up.

The word cat being converted into the integers 99, 97, 116. — Unicode code points for `cat`

Are we done? Unfortunately, no.

The hidden cost: sequence length

A Transformer’s attention mechanism compares every position in the sequence with every other position, meaning the work scales with the square of sequence length. A sequence twice as long is roughly four times as expensive. Therefore, we care a lot about compression.

Below is an ordinary sentence, tokenized the Unicode way.

An English sentence tokenized into 44 Unicode code points. — Character-level tokenization of a sentence

Notice that the model pays a full token for every space, comma, and character it sees. This is terrible! English text is drenched in repetition. The fragments ing, tion, the, pre, show up constantly, and Unicode tokenization has no way to take advantage of that. To Unicode, the is just three unrelated numbers that happen to sit next to each other a lot.

Unicode solves representation, but it does not solve compression. Sequences become punishingly long, making training these models and running inference expensive.

The fix seems obvious. If one token per character is too fine-grained, use a bigger unit. Don’t tokenize letters, tokenize words.

Attempt #2: Give every word its own integer

Maybe we should have started here in the first place. After all, we think in words, not characters. To create a word tokenizer, we build a dictionary: give the the integer 0, give quick the integer 1, fox the integer 2, and so on for every possible word (again, what integer we map every individual word to does not actually matter, as long as the mapping is deterministic and injective). To tokenize, look up each word and replace it with its corresponding integer.

The same sentence tokenized into 10 word tokens instead of 44 characters. — Word-level tokenization of a sentence

Wow. Instead of spending forty-four tokens, we need to spend only ten. We probably just saved a small nuclear reactor worth of energy. Are we done?

No. In fact, there are a couple of problems, one of which is fatal.

Problem one: frequent and rare cost exactly the same

In our word vocabulary, every word is exactly one token, no matter how often it appears. the, which turns up in nearly every sentence ever written, gets one slot. syzygy, a word that I didn’t even know existed before writing this blog, also gets one slot.

This makes the vocabulary (the possible integers that can appear in the sequence) enormous and sparse. To cover a language, you’d need hundreds of thousands of words. Language is also long-tailed: most words are rare, so our vocabulary ends up holding words the model almost never sees and therefore barely learns. We end up spending the same representational budget on things we encounter a million times and things we encounter once.

Also, notice what isn’t shared. run, runs, running, and runner become four unrelated integers, despite sharing a root. The vocabulary can’t utilize this structure.

Problem two: the vocabulary is not exhaustive

Language is open-ended, with new words coined constantly: rizz, mogging, looksmaxxing. No matter how exhaustive you try to make your word vocabulary, there will exist a sentence that contains a word not in your vocabulary.

A new word that has no entry in the word vocabulary, producing an out-of-vocabulary error. — Out-of-vocabulary word lookup

When this happens, it really sucks. A word with no entry in the table is out-of-vocabulary. It is kind of like how there are some words in Mandarin with no equivalent English translation. You might think that you can just create a new entry in your vocabulary for new words, but tokenizers are frozen the moment you ship it. That is not a feasible solution.

This problem is really what kills the possibility of using word tokenizers. It is worth noting that Unicode tokenizers have a milder version of this problem, where new emojis and scripts are added every year, leading to out-of-vocabulary characters.

How do we solve the problem of universality, where every possible character or word is covered in our vocabulary?

Attempt #3: Bytes

How do we solve the problem of representing every possible character or word, even with additional characters or words being added in the future? It turns out this problem has already been solved. After all, computers are able to represent every possible character or word (even ones that have not been created yet), and the way they do that is by using bytes.

UTF-8: characters as bytes

A byte encoding is a way of writing a character down as a sequence of bytes. UTF-8 is the dominant one: a set of rules that maps each character to a sequence of bytes. (It does not replace Unicode. Unicode still assigns the numbers, and UTF-8 only decides how to write them down.) It is variable-width, so a character’s code point is encoded in one to four bytes, with the common cases kept small.

A table showing UTF-8 encoding the characters A, e-acute, and the euro sign as one, two, and three bytes. — UTF-8 byte encoding examples

A character can be more than one token

The byte tokenizer has a vocabulary of exactly 256 entries: one token ID for each byte value, 0 through 255. This means that when a character takes more than one byte, it is represented by a sequence of byte tokens.

Take the euro sign, €. In UTF-8 it is three bytes, E2 82 AC, which is 226, 130, 172 in decimal. A byte-level tokenizer turns € into the token sequence [226, 130, 172].

Notice that this mapping is still deterministic and injective. No other character maps to this token sequence, meaning the sequence [226, 130, 172] is an exact, reversible encoding of €. Meaning arises from the pattern of tokens, rather than just the individual tokens itself (just like how the meaning of cat arises from the letters c, a, t being next to each other, not from the individual letters itself).

With byte tokenization, we get a fixed vocabulary of exactly 256 symbols, a universal representation, and no out-of-vocabulary problem, ever. Nothing can fall outside the range of 0 to 255.

With Unicode tokenization or word tokenization, we were always one new emoji from a gap in integer sequence. With bytes, there are no gaps at all.

The catch: sequences become much longer

The word cafe shown as four characters but five UTF-8 bytes. — Byte expansion for non-ASCII text

The word café is four characters. In UTF-8 it is five bytes, because é takes two. And that is the mild case. Consider the Chinese word 你好 (“hello”). It is two characters. In UTF-8 it is six bytes, three per character. A byte-level tokenizer would hand the model six tokens for a two-character greeting.

So for everything outside of plain ASCII, which is to say most of the languages on Earth, going down to bytes did not shorten our sequences. It lengthened them. We traded one problem for another.

Which lands us on a half-victory, the mirror image of the word attempt:

Bytes give us a finite, universal, gap-free vocabulary and long sequences. Words gave us short sequences and an unbounded vocabulary. Neither extreme is right. But notice we have now got something we lacked before: a closed, gap-free base alphabet we can build on top of. What is still missing is the frequency-awareness, short codes for common patterns, that characters, words, and bytes all lack.

Attempt #4: Let’s find some patterns

Imagine we train on billions of words of text. Now picture how many times the byte sequence for ing appears in it. Millions. Tens of millions. The same is true of tion, of the, of pre and un and establish. These fragments occur, in essentially identical form, an astronomical number of times.

Common word stems paired with their frequent suffixes and prefixes. — Repeated structure in natural language

And every single time one of them appears, our byte-level tokenizer dutifully spells it out from scratch: t, i, o, n. Four tokens. It has encoded tion a million times and learned nothing from the experience. Each occurrence costs exactly as much as the first. This leads to the simple idea:

If a sequence of bytes shows up all the time, give it its own symbol. Spell tion out once, assign it a single number, and from then on pay one token for it instead of four. The more common the pattern, the more that one-time investment pays off. Rare patterns can stay spelled out; we only want to “buy” the frequent ones. This idea leads to the next section.

Byte Pair Encoding: a compression algorithm on top of bytes

This is an oversimplification of the algorithm, but the core idea is to start with a vocabulary of 256 tokens, each representing one byte. Find the most common pair of tokens and represent it as a new token. Now, our vocabulary has 257 tokens. Keep repeating this process until some max vocabulary size. Now, we have a vocabulary where the most frequent byte sequences are represented as one token instead of multiple, improving our compression ratio.

Byte tokens t, i, o, n merging step by step into a single token tion. — Byte Pair Encoding merge steps

With this tokenization strategy, a token becomes simply a useful sequence of bytes. Sometimes a token represents a word (the). Sometimes it represents a fragment of a word (tion). The model sees the world in frequent byte-chunks, not in words. Since all text can be represented as a sequence of bytes, and all bytes can now be represented by a sequence of tokens, and each token has its own unique integer, we have a way to map text to sequences of integers.

Look at what this quietly resolves. The word attempt wanted frequent units to be single tokens, and BPE delivers exactly that: the and and really do become one token each, so ordinary text is as compact as word-tokenization ever promised. But unlike a word vocabulary, BPE never runs out. A word it has never seen isn’t out-of-vocabulary; it just falls back to smaller pieces, and in the worst case to raw bytes, which can spell anything. The vocabulary stays finite and closed, yet frequency now decides how many tokens a piece of text costs. The sparsity and frequency-blindness of Attempt #2 are gone: common patterns are cheap, rare ones are still representable, and run / running / runner can finally share the pieces they have in common.

The complete pipeline

Putting it all together, here is the whole journey from text to token IDs.

Tokenizer pipeline overview

You first encode the string with UTF-8 down to bytes, giving you a universal, gap-free alphabet. You compress those bytes with BPE, which fuses frequent patterns into single symbols and improves compression. You then look up each byte pattern in a vocabulary to find its token IDs, which are the integers the transformer needs.

Conclusion

We now have a good way of transforming text into sequence of integers so that we can leverage the “intelligence” of Transformers! There are still many details that I left out, including talking about training tokenizers, pretokenization, and other details that were out of scope for this blog. To learn more about this subject, I encourage you to take a look at Stanford CS 336’s Language Modeling from Scratch, which talks about all the concepts I mentioned here. If you so dare, I also encourage you to build your own transformer, which I am currently doing here.