Layer Normalization

Layer Normalization is the "Norm" in every "Add & Norm" step of a Transformer. Right after the residual addition, layer normalization rescales the values flowing through the network to keep them in a consistent, well-behaved range. It's a small ingredient, but it's essential — it makes training faster, smoother, and more stable, which is what allows deep Transformers to learn at all.

💡 In one line: Layer normalization rescales each token's vector to a stable range (mean 0, std 1), keeping deep Transformers easy to train.

Why Normalize?

In a deep network, the values (activations) passing from layer to layer can grow or shrink wildly. When that happens, training becomes slow and unstable — gradients explode or vanish, and each layer keeps having to adapt to shifting input ranges.

Normalization fixes this by rescaling values to a consistent distribution, so every layer receives inputs that are well-behaved and predictable.

How Layer Normalization Works

Layer norm operates on each token's vector independently. For a token's feature vector x:

  1. Compute the mean (μ) and standard deviation (σ) across its features.
  2. Normalize: (x − μ) / σ → now the vector has mean 0 and standard deviation 1.
  3. Scale and shift using two learnable parameters, γ (gamma) and β (beta):
LayerNorm(x) = γ · (x − μ) / (σ + ε) + β


The γ and β let the model re-adjust the scale and offset if that helps — so normalization doesn't force a rigid distribution. (ε is a tiny constant for numerical safety.)

Layer Norm vs. Batch Norm

AspectBatch NormalizationLayer Normalization
Normalises acrossThe batch (per feature)The features (per token)
Depends on batch size?YesNo
Variable-length sequencesAwkwardHandles them naturally
Works at batch size 1?PoorlyYes
Used inCNNsTransformers

This is why Transformers use layer norm: it works per token, independent of batch size or sequence length — perfect for text, where sequences vary in length.

Where It Sits: "Add & Norm"

Layer norm is applied right after the residual addition:

output = LayerNorm( x + SubLayer(x) )

…around both the attention sub-layer and the FFN. That's exactly what "Add & Norm" means: Add the residual, then Normalize.

📌 Note: The original Transformer applied norm after the sub-layer ("Post-LN"). Many modern models use Pre-LN (normalise before the sub-layer) because it makes very deep training even more stable.

Code Example


Why It Matters

  • Stable, faster training — values stay in a sensible range.
  • Smoother gradients — fewer exploding/vanishing problems.
  • Consistent inputs — each layer sees a predictable distribution.
  • Batch-independent — ideal for variable-length sequences.

Key Points

TermMeaning
μ, σMean and std of a token's features
γ, βLearnable scale and shift
NormalisesEach token across its own features
The "Norm" in"Add & Norm"

Summary

  • Layer normalization rescales each token's vector to mean 0, std 1, then applies learnable γ and β.
  • It keeps activations stable, making deep Transformers train faster and more reliably.
  • Unlike batch norm, it works per token — independent of batch size — which suits variable-length sequences.
  • It's the "Norm" in "Add & Norm," applied after the residual addition around attention and the FFN.
  • Together, FFN + residual connections + layer normalization complete the Transformer block.