Layer Normalization
Layer Normalization is the "Norm" in every "Add & Norm" step of a Transformer. Right after the residual addition, layer normalization rescales the values flowing through the network to keep them in a consistent, well-behaved range. It's a small ingredient, but it's essential — it makes training faster, smoother, and more stable, which is what allows deep Transformers to learn at all.
💡 In one line: Layer normalization rescales each token's vector to a stable range (mean 0, std 1), keeping deep Transformers easy to train.
Why Normalize?
In a deep network, the values (activations) passing from layer to layer can grow or shrink wildly. When that happens, training becomes slow and unstable — gradients explode or vanish, and each layer keeps having to adapt to shifting input ranges.
Normalization fixes this by rescaling values to a consistent distribution, so every layer receives inputs that are well-behaved and predictable.
How Layer Normalization Works
Layer norm operates on each token's vector independently. For a token's feature vector x:
- Compute the mean (μ) and standard deviation (σ) across its features.
- Normalize:
(x − μ) / σ→ now the vector has mean 0 and standard deviation 1. - Scale and shift using two learnable parameters, γ (gamma) and β (beta):
LayerNorm(x) = γ · (x − μ) / (σ + ε) + β
The γ and β let the model re-adjust the scale and offset if that helps — so normalization doesn't force a rigid distribution. (ε is a tiny constant for numerical safety.)
Layer Norm vs. Batch Norm
| Aspect | Batch Normalization | Layer Normalization |
|---|---|---|
| Normalises across | The batch (per feature) | The features (per token) |
| Depends on batch size? | Yes | No |
| Variable-length sequences | Awkward | Handles them naturally |
| Works at batch size 1? | Poorly | Yes |
| Used in | CNNs | Transformers |
This is why Transformers use layer norm: it works per token, independent of batch size or sequence length — perfect for text, where sequences vary in length.
Where It Sits: "Add & Norm"
Layer norm is applied right after the residual addition:
output = LayerNorm( x + SubLayer(x) )…around both the attention sub-layer and the FFN. That's exactly what "Add & Norm" means: Add the residual, then Normalize.
📌 Note: The original Transformer applied norm after the sub-layer ("Post-LN"). Many modern models use Pre-LN (normalise before the sub-layer) because it makes very deep training even more stable.
Code Example
Why It Matters
- Stable, faster training — values stay in a sensible range.
- Smoother gradients — fewer exploding/vanishing problems.
- Consistent inputs — each layer sees a predictable distribution.
- Batch-independent — ideal for variable-length sequences.
Key Points
| Term | Meaning |
|---|---|
| μ, σ | Mean and std of a token's features |
| γ, β | Learnable scale and shift |
| Normalises | Each token across its own features |
| The "Norm" in | "Add & Norm" |
Summary
- Layer normalization rescales each token's vector to mean 0, std 1, then applies learnable γ and β.
- It keeps activations stable, making deep Transformers train faster and more reliably.
- Unlike batch norm, it works per token — independent of batch size — which suits variable-length sequences.
- It's the "Norm" in "Add & Norm," applied after the residual addition around attention and the FFN.
- Together, FFN + residual connections + layer normalization complete the Transformer block.