Splitting Text into Tokens

Before an LLM can read your text, it must split it into tokens. This splitting process — tokenization — happens in a few clear steps. Walking through them shows exactly how a raw string of characters becomes the neat list of tokens the model counts and processes. (For the unit itself see Tokens, and for the tool see Tokenizers — this article is about the splitting process.)

💡 In one line: Splitting text into tokens means normalising it, cutting it into rough pieces, then breaking those into known sub-word tokens — and mapping each to an ID.

The Goal

Turn a raw string into a list of tokens (and then token IDs). The split must be consistent (same text → same tokens every time) and reversible (you can decode the tokens back into the original text).

The Steps of Splitting

Tokenization usually runs in four stages:

  1. Normalize — clean the text: handle whitespace, Unicode, accents, and (for some models) lowercase it.
  2. Pre-tokenize — split into rough pieces, typically on whitespace and punctuation.
  3. Sub-word splitting — apply the tokenizer's learned rules (e.g. BPE merges) to break those pieces into known sub-word tokens.
  4. Map to IDs — convert each token into its vocabulary ID.

How Sub-word Splitting Decides Where to Cut

The interesting part is step 3. The tokenizer holds a learned list of merges (frequent character/sub-word pairs). Starting from small pieces, it applies merges to combine them into the longest known sub-words:

  • Common words stay whole: "learning" might be one token.
  • Rarer words get split: "unbelievable""un" + "believ" + "able".

So frequent words are efficient (one token), while any word — even one never seen before — can still be represented by falling back to smaller pieces.

Why Split This Way?

  • Handles any word — unknown words break into known sub-pieces, so there are no "unknown word" failures.
  • Stays efficient — common words remain single tokens.
  • Reversible — the pieces can always be joined back into the original text.

What Influences the Split

  • Spaces — often attached to the following token (so " cat" and "cat" can differ).
  • Capitalization and punctuation — change how pieces are cut.
  • The specific tokenizer — its vocabulary and merge rules decide the exact split, which is why different models tokenize the same text differently.

Splitting in Code


Notice token + ##ization: the ## marks a continuation sub-word (a piece that attaches to the previous one). This is how a longer word is split into known pieces. (Runs with pip install transformers.)

Summary

  • Splitting text into tokens runs in four steps: normalize → pre-tokenize → sub-word split → map to IDs.
  • Sub-word splitting uses learned merges to keep common words whole and break rare ones into pieces.
  • This makes tokenization robust (handles any word), efficient, and reversible.
  • Spaces, capitalization, and the tokenizer's rules all influence exactly where the cuts happen.
  • The result is the list of token IDs the model actually reads.