Positional Encoding

Transformers process every token in a sequence at the same time, in parallel. This is what makes them fast β€” but it creates a problem: with everything processed at once, the model has no idea what order the words came in. Yet word order is essential to meaning. Positional Encoding is the solution: it injects information about each token's position into its embedding, so the model knows the sequence order.

πŸ’‘ In one line: Positional encoding adds word-order information to embeddings, because a Transformer processes all tokens at once and would otherwise be order-blind.

Why is Positional Encoding Needed?

An RNN reads tokens one at a time, so order is built in automatically. A Transformer doesn't β€” it sees the whole sequence simultaneously. Without position information, these two sentences would look identical to the model:

  • "The cat sat on the mat."
  • "The mat sat on the cat."

Same words, completely different meaning. So we must explicitly tell the model where each token sits in the sequence.

The Core Idea

The approach is simple: give each position a unique signature vector, and add it to the token's embedding. After this, every token's vector encodes both its meaning and its position.

input vector = token embedding + positional encoding

Sinusoidal Positional Encoding

The original Transformer used a clever sinusoidal scheme. Each position gets a vector built from sine and cosine waves of different frequencies:

PE(pos, 2i)   = sin( pos / 10000^(2i/d) )
PE(pos, 2i+1) = cos( pos / 10000^(2i/d) )


where pos is the position and d is the embedding dimension. Why this works well:

  • Unique β€” every position gets a distinct pattern.
  • Smooth β€” nearby positions have similar encodings.
  • Relative positions β€” the model can easily learn "how far apart" two tokens are.
  • Generalises β€” it can handle sequences longer than those seen in training.

Code Example


# It's then simply added to the token embeddings: # input_vectors = token_embeddings + pe

Fixed vs. Learned Positional Encodings

There are two common families:

TypeHow it worksTrade-off
Fixed (sinusoidal)Computed by a formula, not trainedGeneralises to longer/unseen sequences
LearnedPosition vectors learned during trainingSimple, but limited to trained length

Some models (like BERT and GPT) use learned positional embeddings, treating positions much like tokens.

How It's Used in the Transformer

Positional encoding sits right at the input:

  1. Tokens become embeddings.
  2. A positional encoding is added to each embedding (element-wise).
  3. The combined vectors flow into the first encoder/decoder block.

So by the time data reaches attention, every token "knows" both what it is and where it is.

Modern Variants

The field has moved beyond the original sinusoids. Newer methods used in modern LLMs include:

  • RoPE (Rotary Positional Embeddings) β€” rotates vectors by position; very popular in current LLMs.
  • Relative position encodings β€” encode distances between tokens directly.
  • ALiBi β€” biases attention by distance, helping with long contexts.

Summary

  • Transformers process tokens in parallel, so they need positional encoding to know word order.
  • Each position gets a signature vector that is added to the token embedding.
  • The original method is sinusoidal β€” sine/cosine waves of varying frequencies β€” which is unique, smooth, and generalises to longer sequences.
  • Encodings can be fixed (sinusoidal) or learned; modern LLMs often use variants like RoPE.
  • The result: every token's input vector encodes both meaning and position.