Encoder & Decoder
The Transformer is built from two kinds of stacks: the Encoder and the Decoder. They look similar — both are stacks of repeated blocks — but they play opposite roles. The encoder's job is to understand the input; the decoder's job is to generate the output. Understanding how they differ and how they connect is the key to understanding both the original Transformer and the model variants (BERT, GPT, T5) that use one stack or the other.
💡 In one line: The encoder reads and understands the input; the decoder generates the output one token at a time, using the encoder's understanding.
The Encoder: Understanding the Input
The encoder takes the input sequence (embeddings + positional encoding) and transforms it into a set of rich, context-aware representations — one vector per input token.
Its defining feature is bidirectional self-attention: every token can look at every other token, both to its left and right. This lets the encoder build a deep, full-context understanding of the input.
Each encoder block has two sub-layers:
- Self-Attention (bidirectional)
- Feed-Forward Network
…each wrapped in Add & Norm. The block is repeated N times.
The Decoder: Generating the Output
The decoder produces the output sequence one token at a time (auto-regressively). It has two kinds of attention:
- Masked self-attention — each position can only attend to earlier positions, never future ones. This makes sense: when generating, the future tokens don't exist yet.
- Cross-attention — the decoder looks at the encoder's output, so it can use the meaning of the input while generating.
Each decoder block has three sub-layers:
- Masked Self-Attention
- Cross-Attention (to the encoder's output)
- Feed-Forward Network
…each wrapped in Add & Norm, repeated N times.
The Three Types of Attention
The encoder–decoder design uses attention in three distinct ways:
| Attention type | Where | What it does |
|---|---|---|
| Self-attention (bidirectional) | Encoder | Input tokens attend to all input tokens |
| Masked self-attention | Decoder | Output tokens attend only to earlier output tokens |
| Cross-attention | Decoder | Output tokens attend to the encoder's output |
(The mechanics of each are covered in the Attention Mechanism subtopics.)
How They Connect
The flow is straightforward:
- The encoder runs once over the full input and produces its representations.
- The decoder generates the output token by token. At every decoder block, cross-attention pulls in the encoder's output.
Cross-attention is the bridge between input and output — it's how a translation model, for example, keeps the output faithful to the source sentence.
Why Mask the Decoder?
During generation, the model predicts the next token. If it could see future tokens during training, it would simply "cheat" by copying the answer. Masking hides future positions so that training matches the real generation setting, where the future is genuinely unknown.
Encoder-only, Decoder-only, Encoder–Decoder
Not every model uses both stacks — and this choice defines the major model families:
| Design | Example | Best for |
|---|---|---|
| Encoder-only | BERT | Understanding tasks (classification, embeddings) |
| Decoder-only | GPT | Text generation (most modern LLMs) |
| Encoder–Decoder | T5, original Transformer | Sequence-to-sequence (translation, summarisation) |
These three variants are the subject of the next subtopic.
Summary
- The Transformer has an Encoder (understands input) and a Decoder (generates output).
- The encoder uses bidirectional self-attention to see the whole input at once.
- The decoder uses masked self-attention (past only) plus cross-attention to the encoder.
- Cross-attention is the bridge connecting input understanding to output generation.
- Models can be encoder-only (BERT), decoder-only (GPT), or encoder–decoder (T5) — explored next.