Transformer Architecture
The Transformer is the architecture behind virtually all modern Generative AI — from chatbots to image generators. Introduced in 2017 in the paper "Attention Is All You Need," it threw out the step-by-step recurrence of RNNs and replaced it with one powerful idea: attention. By processing an entire sequence in parallel and letting every token directly look at every other token, the Transformer solved the limitations of RNNs and LSTMs in one stroke.
This article gives the high-level map of the architecture. Each component named here gets its own deep-dive in the following subtopics.
💡 In one line: A Transformer processes a whole sequence at once and uses attention to let every token relate to every other token — making it fast, powerful, and scalable.
The Big Picture: Encoder & Decoder
The original Transformer has two main stacks:
- Encoder — reads the input sequence and turns it into rich, context-aware representations (it understands the input).
- Decoder — uses the encoder's output, plus what it has generated so far, to produce the output sequence one token at a time (it generates).
This encoder–decoder design was built for sequence-to-sequence tasks like translation (e.g. English → French).
The Data Flow, Step by Step
Here's how information moves through a Transformer:
- Tokenisation & Embeddings — the input text is split into tokens, and each token becomes a numeric vector (embedding).
- Positional Encoding — since the model reads everything at once (no order from recurrence), position information is added to the embeddings so the model knows token order.
- Encoder stack — the embeddings pass through N identical encoder blocks, each refining the representation using attention.
- Decoder stack — the decoder takes the output generated so far (also embedded + positionally encoded) and passes it through N decoder blocks, attending both to itself and to the encoder's output.
- Linear + Softmax — the final vector is turned into a probability distribution over the vocabulary, and the most likely next token is chosen.
Inside an Encoder Block
Each of the N encoder blocks contains two sub-layers:
- Multi-Head Self-Attention — lets each token look at all other tokens in the input.
- Feed-Forward Network (FFN) — a small network applied to each token.
Each sub-layer is wrapped with a residual connection and layer normalization ("Add & Norm") for stable training.
Inside a Decoder Block
Each of the N decoder blocks has three sub-layers:
- Masked Multi-Head Self-Attention — attends to previously generated tokens only (it can't peek at future ones).
- Cross-Attention — attends to the encoder's output, connecting input and output.
- Feed-Forward Network (FFN).
Again, each sub-layer uses Add & Norm.
Key Components at a Glance
| Component | Role | Deep-dive later in |
|---|---|---|
| Embeddings | Turn tokens into vectors | Tokens & Embeddings |
| Positional Encoding | Inject word order | Positional Encoding |
| Multi-Head Attention | Relate every token to every other | Attention Mechanism |
| Feed-Forward Network | Per-token transformation | Transformer Block |
| Add & Norm | Residual + layer normalization for stability | Transformer Block |
| Linear + Softmax | Predict the next token | Next Token Prediction |
Why It Works So Well
- Parallel processing — the whole sequence is handled at once, so training is fast and GPU-friendly.
- Attention — any token can directly relate to any other, capturing long-range dependencies with no fading.
- Stacked layers — repeating blocks build progressively richer representations.
- Scales beautifully — bigger Transformers trained on more data keep getting better (scaling laws), which is exactly why LLMs are built on them.
A Preview of the Variants
Not every model uses both stacks. Later in this topic we'll see:
- Encoder-only (like BERT) — great for understanding tasks.
- Decoder-only (like GPT) — built for generation (the basis of most LLMs).
- Encoder–Decoder (like T5) — for sequence-to-sequence tasks.
Summary
- The Transformer replaces recurrence with attention, processing sequences in parallel.
- It has an Encoder (understands input) and a Decoder (generates output), each a stack of N blocks.
- Data flows: tokens → embeddings + positional encoding → encoder → decoder → linear + softmax → next token.
- Encoder blocks use self-attention + FFN; decoder blocks add masked self-attention and cross-attention.
- Its parallelism, attention, and scalability make it the foundation of all modern Generative AI — explored piece by piece in the next subtopics.