Layer Normalization

Last updated: Jun 27, 2026

Author :

Vinay Adari

Layer Normalization

Layer Normalization is the "Norm" in every "Add & Norm" step of a Transformer. Right after the residual addition, layer normalization rescales the values flowing through the network to keep them in a consistent, well-behaved range. It's a small ingredient, but it's essential — it makes training faster, smoother, and more stable, which is what allows deep Transformers to learn at all.

💡 In one line: Layer normalization rescales each token's vector to a stable range (mean 0, std 1), keeping deep Transformers easy to train.

Why Normalize?

In a deep network, the values (activations) passing from layer to layer can grow or shrink wildly. When that happens, training becomes slow and unstable — gradients explode or vanish, and each layer keeps having to adapt to shifting input ranges.

Normalization fixes this by rescaling values to a consistent distribution, so every layer receives inputs that are well-behaved and predictable.

How Layer Normalization Works

Layer norm operates on each token's vector independently. For a token's feature vector x:

Compute the mean (μ) and standard deviation (σ) across its features.
Normalize: (x − μ) / σ → now the vector has mean 0 and standard deviation 1.
Scale and shift using two learnable parameters, γ (gamma) and β (beta):

LayerNorm(x) = γ · (x − μ) / (σ + ε) + β

The γ and β let the model re-adjust the scale and offset if that helps — so normalization doesn't force a rigid distribution. (ε is a tiny constant for numerical safety.)

Layer Norm vs. Batch Norm

Aspect	Batch Normalization	Layer Normalization
Normalises across	The batch (per feature)	The features (per token)
Depends on batch size?	Yes	No
Variable-length sequences	Awkward	Handles them naturally
Works at batch size 1?	Poorly	Yes
Used in	CNNs	Transformers

This is why Transformers use layer norm: it works per token, independent of batch size or sequence length — perfect for text, where sequences vary in length.

Where It Sits: "Add & Norm"

Layer norm is applied right after the residual addition:

output = LayerNorm( x + SubLayer(x) )

…around both the attention sub-layer and the FFN. That's exactly what "Add & Norm" means: Add the residual, then Normalize.

📌 Note: The original Transformer applied norm after the sub-layer ("Post-LN"). Many modern models use Pre-LN (normalise before the sub-layer) because it makes very deep training even more stable.

Code Example

Why It Matters

Stable, faster training — values stay in a sensible range.
Smoother gradients — fewer exploding/vanishing problems.
Consistent inputs — each layer sees a predictable distribution.
Batch-independent — ideal for variable-length sequences.

Key Points

Term	Meaning
μ, σ	Mean and std of a token's features
γ, β	Learnable scale and shift
Normalises	Each token across its own features
The "Norm" in	"Add & Norm"

Summary

Layer normalization rescales each token's vector to mean 0, std 1, then applies learnable γ and β.
It keeps activations stable, making deep Transformers train faster and more reliably.
Unlike batch norm, it works per token — independent of batch size — which suits variable-length sequences.
It's the "Norm" in "Add & Norm," applied after the residual addition around attention and the FFN.
Together, FFN + residual connections + layer normalization complete the Transformer block.