Splitting Text into Tokens
Before an LLM can read your text, it must split it into tokens. This splitting process — tokenization — happens in a few clear steps. Walking through them shows exactly how a raw string of characters becomes the neat list of tokens the model counts and processes. (For the unit itself see Tokens, and for the tool see Tokenizers — this article is about the splitting process.)
💡 In one line: Splitting text into tokens means normalising it, cutting it into rough pieces, then breaking those into known sub-word tokens — and mapping each to an ID.
The Goal
Turn a raw string into a list of tokens (and then token IDs). The split must be consistent (same text → same tokens every time) and reversible (you can decode the tokens back into the original text).
The Steps of Splitting
Tokenization usually runs in four stages:
- Normalize — clean the text: handle whitespace, Unicode, accents, and (for some models) lowercase it.
- Pre-tokenize — split into rough pieces, typically on whitespace and punctuation.
- Sub-word splitting — apply the tokenizer's learned rules (e.g. BPE merges) to break those pieces into known sub-word tokens.
- Map to IDs — convert each token into its vocabulary ID.
How Sub-word Splitting Decides Where to Cut
The interesting part is step 3. The tokenizer holds a learned list of merges (frequent character/sub-word pairs). Starting from small pieces, it applies merges to combine them into the longest known sub-words:
- Common words stay whole:
"learning"might be one token. - Rarer words get split:
"unbelievable"→"un"+"believ"+"able".
So frequent words are efficient (one token), while any word — even one never seen before — can still be represented by falling back to smaller pieces.
Why Split This Way?
- Handles any word — unknown words break into known sub-pieces, so there are no "unknown word" failures.
- Stays efficient — common words remain single tokens.
- Reversible — the pieces can always be joined back into the original text.
What Influences the Split
- Spaces — often attached to the following token (so
" cat"and"cat"can differ). - Capitalization and punctuation — change how pieces are cut.
- The specific tokenizer — its vocabulary and merge rules decide the exact split, which is why different models tokenize the same text differently.
Splitting in Code
Notice token + ##ization: the ## marks a continuation sub-word (a piece that attaches to the previous one). This is how a longer word is split into known pieces. (Runs with pip install transformers.)
Summary
- Splitting text into tokens runs in four steps: normalize → pre-tokenize → sub-word split → map to IDs.
- Sub-word splitting uses learned merges to keep common words whole and break rare ones into pieces.
- This makes tokenization robust (handles any word), efficient, and reversible.
- Spaces, capitalization, and the tokenizer's rules all influence exactly where the cuts happen.
- The result is the list of token IDs the model actually reads.