Transformer Architecture

Last updated: Jun 24, 2026

Author :

Vinay Adari

Transformer Architecture

The Transformer is the architecture behind virtually all modern Generative AI — from chatbots to image generators. Introduced in 2017 in the paper "Attention Is All You Need," it threw out the step-by-step recurrence of RNNs and replaced it with one powerful idea: attention. By processing an entire sequence in parallel and letting every token directly look at every other token, the Transformer solved the limitations of RNNs and LSTMs in one stroke.

This article gives the high-level map of the architecture. Each component named here gets its own deep-dive in the following subtopics.

💡 In one line: A Transformer processes a whole sequence at once and uses attention to let every token relate to every other token — making it fast, powerful, and scalable.

The Big Picture: Encoder & Decoder

The original Transformer has two main stacks:

Encoder — reads the input sequence and turns it into rich, context-aware representations (it understands the input).
Decoder — uses the encoder's output, plus what it has generated so far, to produce the output sequence one token at a time (it generates).

This encoder–decoder design was built for sequence-to-sequence tasks like translation (e.g. English → French).

The Data Flow, Step by Step

Here's how information moves through a Transformer:

Tokenisation & Embeddings — the input text is split into tokens, and each token becomes a numeric vector (embedding).
Positional Encoding — since the model reads everything at once (no order from recurrence), position information is added to the embeddings so the model knows token order.
Encoder stack — the embeddings pass through N identical encoder blocks, each refining the representation using attention.
Decoder stack — the decoder takes the output generated so far (also embedded + positionally encoded) and passes it through N decoder blocks, attending both to itself and to the encoder's output.
Linear + Softmax — the final vector is turned into a probability distribution over the vocabulary, and the most likely next token is chosen.

Inside an Encoder Block

Each of the N encoder blocks contains two sub-layers:

Multi-Head Self-Attention — lets each token look at all other tokens in the input.
Feed-Forward Network (FFN) — a small network applied to each token.

Each sub-layer is wrapped with a residual connection and layer normalization ("Add & Norm") for stable training.

Inside a Decoder Block

Each of the N decoder blocks has three sub-layers:

Masked Multi-Head Self-Attention — attends to previously generated tokens only (it can't peek at future ones).
Cross-Attention — attends to the encoder's output, connecting input and output.
Feed-Forward Network (FFN).

Again, each sub-layer uses Add & Norm.

Key Components at a Glance

Component	Role	Deep-dive later in
Embeddings	Turn tokens into vectors	Tokens & Embeddings
Positional Encoding	Inject word order	Positional Encoding
Multi-Head Attention	Relate every token to every other	Attention Mechanism
Feed-Forward Network	Per-token transformation	Transformer Block
Add & Norm	Residual + layer normalization for stability	Transformer Block
Linear + Softmax	Predict the next token	Next Token Prediction

Why It Works So Well

Parallel processing — the whole sequence is handled at once, so training is fast and GPU-friendly.
Attention — any token can directly relate to any other, capturing long-range dependencies with no fading.
Stacked layers — repeating blocks build progressively richer representations.
Scales beautifully — bigger Transformers trained on more data keep getting better (scaling laws), which is exactly why LLMs are built on them.

A Preview of the Variants

Not every model uses both stacks. Later in this topic we'll see:

Encoder-only (like BERT) — great for understanding tasks.
Decoder-only (like GPT) — built for generation (the basis of most LLMs).
Encoder–Decoder (like T5) — for sequence-to-sequence tasks.

Summary

The Transformer replaces recurrence with attention, processing sequences in parallel.
It has an Encoder (understands input) and a Decoder (generates output), each a stack of N blocks.
Data flows: tokens → embeddings + positional encoding → encoder → decoder → linear + softmax → next token.
Encoder blocks use self-attention + FFN; decoder blocks add masked self-attention and cross-attention.
Its parallelism, attention, and scalability make it the foundation of all modern Generative AI — explored piece by piece in the next subtopics.