Split into Tokens

Last updated: Jun 30, 2026

Author :

Vinay Adari

Splitting Text into Tokens

Before an LLM can read your text, it must split it into tokens. This splitting process — tokenization — happens in a few clear steps. Walking through them shows exactly how a raw string of characters becomes the neat list of tokens the model counts and processes. (For the unit itself see Tokens, and for the tool see Tokenizers — this article is about the splitting process.)

💡 In one line: Splitting text into tokens means normalising it, cutting it into rough pieces, then breaking those into known sub-word tokens — and mapping each to an ID.

The Goal

Turn a raw string into a list of tokens (and then token IDs). The split must be consistent (same text → same tokens every time) and reversible (you can decode the tokens back into the original text).

The Steps of Splitting

Tokenization usually runs in four stages:

Normalize — clean the text: handle whitespace, Unicode, accents, and (for some models) lowercase it.
Pre-tokenize — split into rough pieces, typically on whitespace and punctuation.
Sub-word splitting — apply the tokenizer's learned rules (e.g. BPE merges) to break those pieces into known sub-word tokens.
Map to IDs — convert each token into its vocabulary ID.

How Sub-word Splitting Decides Where to Cut

The interesting part is step 3. The tokenizer holds a learned list of merges (frequent character/sub-word pairs). Starting from small pieces, it applies merges to combine them into the longest known sub-words:

Common words stay whole: "learning" might be one token.
Rarer words get split: "unbelievable" → "un" + "believ" + "able".

So frequent words are efficient (one token), while any word — even one never seen before — can still be represented by falling back to smaller pieces.

Why Split This Way?

Handles any word — unknown words break into known sub-pieces, so there are no "unknown word" failures.
Stays efficient — common words remain single tokens.
Reversible — the pieces can always be joined back into the original text.

What Influences the Split

Spaces — often attached to the following token (so " cat" and "cat" can differ).
Capitalization and punctuation — change how pieces are cut.
The specific tokenizer — its vocabulary and merge rules decide the exact split, which is why different models tokenize the same text differently.

Splitting in Code

Notice token + ##ization: the ## marks a continuation sub-word (a piece that attaches to the previous one). This is how a longer word is split into known pieces. (Runs with pip install transformers.)

Summary

Splitting text into tokens runs in four steps: normalize → pre-tokenize → sub-word split → map to IDs.
Sub-word splitting uses learned merges to keep common words whole and break rare ones into pieces.
This makes tokenization robust (handles any word), efficient, and reversible.
Spaces, capitalization, and the tokenizer's rules all influence exactly where the cuts happen.
The result is the list of token IDs the model actually reads.