Tokens & Embeddings
A Transformer doesn't understand words β it only understands numbers. So before any text can enter the model, it must be converted into numbers in two steps: tokenization (split the text into tokens) and embedding (turn each token into a vector of numbers that captures its meaning). This is the very first thing that happens to your input, and it's where language becomes math.
π‘ In one line: Tokens split text into pieces, and embeddings turn each piece into a meaningful vector the model can work with.
What is a Token?
A token is a chunk of text the model processes. It's often a word, but not always β many tokenizers use subwords (word-pieces). Common words become a single token, while rare or complex words get split into smaller pieces:
"cat"β["cat"]"unhappiness"β["un", "happiness"]"Generative"β["gen", "erative"]
Why subwords? They strike a balance: a manageable vocabulary size, while still being able to represent any word (even ones never seen before) by breaking it into known pieces. The process of splitting text is called tokenization.
Tokenization in Code
(The exact tokens and IDs depend on the tokenizer β these are illustrative.)
Token IDs and the Vocabulary
Each token maps to a unique integer β its token ID β drawn from the model's vocabulary (its full list of known tokens). So the text becomes a list of IDs. The model works entirely with these numbers.
What is an Embedding?
A token ID like 9006 is just an index β it carries no meaning on its own. An embedding fixes this: it maps each token ID to a dense vector β a list of numbers (often hundreds long, e.g. 768 values) that captures the token's meaning.
These vectors are learned during training. The magic is that tokens with similar meanings end up with similar vectors, so meaning becomes something the model can measure mathematically.
import torch
import torch.nn as nn
vocab_size = 30000
embed_dim = 768
embedding = nn.Embedding(vocab_size, embed_dim)
token_ids = torch.tensor([9006, 8743, 9932])
vectors = embedding(token_ids)
print(vectors.shape) # torch.Size([3, 768]) β 3 tokens, each a 768-dim vectorThe Embedding Space
Embeddings live in a high-dimensional space with a remarkable property: meaning becomes geometry.
- Similar words sit close together β "cat," "dog," and "kitten" cluster in the same region.
- Relationships become directions β the famous example:
king β man + woman β queen. The "gender" relationship is a consistent direction in the space.
Where They Fit in the Transformer
Tokens and embeddings are the input layer of every Transformer:
- Your text is tokenized into tokens.
- Tokens become token IDs.
- IDs are turned into embedding vectors.
- Positional encoding is added (next subtopic), and the result flows into the encoder/decoder.
At the very end, the model predicts the next token's ID, which is decoded back into text.
Key Terms
| Term | Meaning |
|---|---|
| Token | A chunk of text (word or sub-word) |
| Tokenization | Splitting text into tokens |
| Token ID | The integer index of a token |
| Vocabulary | The full set of tokens a model knows |
| Embedding | A dense vector that captures a token's meaning |
| Embedding dimension | The length of that vector (e.g. 768) |
- A Transformer works on numbers, so text is first tokenized and then embedded.
- Tokens are chunks of text (often sub-words); each maps to an integer token ID from the vocabulary.
- An embedding turns each ID into a dense vector that captures meaning, learned during training.
- In the embedding space, similar words cluster together and relationships form consistent directions (
king β man + woman β queen). - Together, tokens and embeddings form the input layer of every Transformer β the bridge from language to math.