Encoder & Decoder in Transformers

Last updated: Jun 27, 2026

Author :

Vinay Adari

Encoder & Decoder

The Transformer is built from two kinds of stacks: the Encoder and the Decoder. They look similar — both are stacks of repeated blocks — but they play opposite roles. The encoder's job is to understand the input; the decoder's job is to generate the output. Understanding how they differ and how they connect is the key to understanding both the original Transformer and the model variants (BERT, GPT, T5) that use one stack or the other.

💡 In one line: The encoder reads and understands the input; the decoder generates the output one token at a time, using the encoder's understanding.

The Encoder: Understanding the Input

The encoder takes the input sequence (embeddings + positional encoding) and transforms it into a set of rich, context-aware representations — one vector per input token.

Its defining feature is bidirectional self-attention: every token can look at every other token, both to its left and right. This lets the encoder build a deep, full-context understanding of the input.

Each encoder block has two sub-layers:

Self-Attention (bidirectional)
Feed-Forward Network

…each wrapped in Add & Norm. The block is repeated N times.

The Decoder: Generating the Output

The decoder produces the output sequence one token at a time (auto-regressively). It has two kinds of attention:

Masked self-attention — each position can only attend to earlier positions, never future ones. This makes sense: when generating, the future tokens don't exist yet.
Cross-attention — the decoder looks at the encoder's output, so it can use the meaning of the input while generating.

Each decoder block has three sub-layers:

Masked Self-Attention
Cross-Attention (to the encoder's output)
Feed-Forward Network

…each wrapped in Add & Norm, repeated N times.

The Three Types of Attention

The encoder–decoder design uses attention in three distinct ways:

Attention type	Where	What it does
Self-attention (bidirectional)	Encoder	Input tokens attend to all input tokens
Masked self-attention	Decoder	Output tokens attend only to earlier output tokens
Cross-attention	Decoder	Output tokens attend to the encoder's output

(The mechanics of each are covered in the Attention Mechanism subtopics.)

How They Connect

The flow is straightforward:

The encoder runs once over the full input and produces its representations.
The decoder generates the output token by token. At every decoder block, cross-attention pulls in the encoder's output.

Cross-attention is the bridge between input and output — it's how a translation model, for example, keeps the output faithful to the source sentence.

Why Mask the Decoder?

During generation, the model predicts the next token. If it could see future tokens during training, it would simply "cheat" by copying the answer. Masking hides future positions so that training matches the real generation setting, where the future is genuinely unknown.

Encoder-only, Decoder-only, Encoder–Decoder

Not every model uses both stacks — and this choice defines the major model families:

Design	Example	Best for
Encoder-only	BERT	Understanding tasks (classification, embeddings)
Decoder-only	GPT	Text generation (most modern LLMs)
Encoder–Decoder	T5, original Transformer	Sequence-to-sequence (translation, summarisation)

These three variants are the subject of the next subtopic.

Summary

The Transformer has an Encoder (understands input) and a Decoder (generates output).
The encoder uses bidirectional self-attention to see the whole input at once.
The decoder uses masked self-attention (past only) plus cross-attention to the encoder.
Cross-attention is the bridge connecting input understanding to output generation.
Models can be encoder-only (BERT), decoder-only (GPT), or encoder–decoder (T5) — explored next.