Tokenizers
A tokenizer is the tool that converts text into tokens β and back again. It sits between human language and the model: every prompt is encoded into token IDs before the model sees it, and the model's output IDs are decoded back into readable text. Each model ships with its own tokenizer, and that choice quietly affects efficiency, language support, and how many tokens your text costs.
π‘ In one line: A tokenizer is the translator between text and token IDs β it encodes your input for the model and decodes the model's output back into words.
What a Tokenizer Does: Encode & Decode
A tokenizer works in two directions:
- Encode: text β tokens β token IDs (numbers the model uses).
- Decode: token IDs β tokens β text (what you read).
The model only ever sees IDs β the tokenizer is the bridge on both ends.
Levels of Tokenization
There are three ways to split text, with very different trade-offs:
- Word-level β split on whole words. Simple, but needs a huge vocabulary and can't handle unknown words (out-of-vocabulary).
- Character-level β split into individual characters. Tiny vocabulary and handles anything, but produces very long sequences and loses word meaning.
- Sub-word level β split into common pieces. The best of both: a manageable vocabulary that can still represent any word. This is what modern LLMs use.
Sub-word Tokenization Algorithms
Modern tokenizers build their sub-word vocabulary using methods like:
- BPE (Byte-Pair Encoding) β start from characters and repeatedly merge the most frequent pair into a new token, building up common sub-words.
- Byte-level BPE β operates on bytes, so it can handle any character, including emojis and all languages.
- WordPiece β used by BERT; merges pieces by likelihood.
- Unigram / SentencePiece β probabilistic and language-agnostic (doesn't rely on spaces), common in multilingual models.
How a Tokenizer is "Trained"
A tokenizer is built from a corpus: it learns which sub-word pieces (and merges) are most useful, based on frequency in the data. The result is a fixed vocabulary plus a set of merge rules. This is done once, and then the same tokenizer is reused for all encoding and decoding.
The Vocabulary
The vocabulary is the fixed set of all tokens the model knows β often tens of thousands. Each token maps to a unique ID. A bigger vocabulary means fewer tokens per text (more is packed into each token), but a larger embedding table to store.
Special Tokens
Tokenizers also add special tokens that mark structure:
- [BOS] / [EOS] β begin / end of sequence
- [PAD] β padding to equal lengths
- [UNK] β an unknown token
- [CLS] / [SEP] / [MASK] β used by BERT
Why the Tokenizer Choice Matters
- Different models, different token counts β the same text can tokenize to different numbers of tokens across models.
- Efficiency β fewer tokens means cheaper and faster.
- Language coverage β English-optimised tokenizers often use more tokens for other languages or code.
- Matching is required β you must use a model's own tokenizer; mixing them breaks things.
Code Example
This shows the full round trip: text β IDs β tokens β back to text. (Runs with pip install transformers.)
Summary
- A tokenizer encodes text into token IDs and decodes IDs back into text.
- Tokenization can be word, character, or sub-word level β modern LLMs use sub-word.
- Sub-word vocabularies are built with algorithms like BPE, WordPiece, and SentencePiece.
- A tokenizer has a fixed vocabulary and adds special tokens for structure.
- Each model has its own tokenizer β affecting token counts, cost, and language support.