Positional Encoding
Transformers process every token in a sequence at the same time, in parallel. This is what makes them fast β but it creates a problem: with everything processed at once, the model has no idea what order the words came in. Yet word order is essential to meaning. Positional Encoding is the solution: it injects information about each token's position into its embedding, so the model knows the sequence order.
π‘ In one line: Positional encoding adds word-order information to embeddings, because a Transformer processes all tokens at once and would otherwise be order-blind.
Why is Positional Encoding Needed?
An RNN reads tokens one at a time, so order is built in automatically. A Transformer doesn't β it sees the whole sequence simultaneously. Without position information, these two sentences would look identical to the model:
- "The cat sat on the mat."
- "The mat sat on the cat."
Same words, completely different meaning. So we must explicitly tell the model where each token sits in the sequence.
The Core Idea
The approach is simple: give each position a unique signature vector, and add it to the token's embedding. After this, every token's vector encodes both its meaning and its position.
input vector = token embedding + positional encodingSinusoidal Positional Encoding
The original Transformer used a clever sinusoidal scheme. Each position gets a vector built from sine and cosine waves of different frequencies:
PE(pos, 2i) = sin( pos / 10000^(2i/d) )
PE(pos, 2i+1) = cos( pos / 10000^(2i/d) )
where pos is the position and d is the embedding dimension. Why this works well:
- Unique β every position gets a distinct pattern.
- Smooth β nearby positions have similar encodings.
- Relative positions β the model can easily learn "how far apart" two tokens are.
- Generalises β it can handle sequences longer than those seen in training.
Code Example
Fixed vs. Learned Positional Encodings
There are two common families:
| Type | How it works | Trade-off |
|---|---|---|
| Fixed (sinusoidal) | Computed by a formula, not trained | Generalises to longer/unseen sequences |
| Learned | Position vectors learned during training | Simple, but limited to trained length |
Some models (like BERT and GPT) use learned positional embeddings, treating positions much like tokens.
How It's Used in the Transformer
Positional encoding sits right at the input:
- Tokens become embeddings.
- A positional encoding is added to each embedding (element-wise).
- The combined vectors flow into the first encoder/decoder block.
So by the time data reaches attention, every token "knows" both what it is and where it is.
Modern Variants
The field has moved beyond the original sinusoids. Newer methods used in modern LLMs include:
- RoPE (Rotary Positional Embeddings) β rotates vectors by position; very popular in current LLMs.
- Relative position encodings β encode distances between tokens directly.
- ALiBi β biases attention by distance, helping with long contexts.
Summary
- Transformers process tokens in parallel, so they need positional encoding to know word order.
- Each position gets a signature vector that is added to the token embedding.
- The original method is sinusoidal β sine/cosine waves of varying frequencies β which is unique, smooth, and generalises to longer sequences.
- Encodings can be fixed (sinusoidal) or learned; modern LLMs often use variants like RoPE.
- The result: every token's input vector encodes both meaning and position.