Residual Connections

Transformers are deep — they stack many blocks, and each block has several sub-layers. But training very deep networks is hard: gradients fade, and useful information can get lost as it passes through layer after layer. Residual connections (also called skip connections) are a simple, elegant fix that makes deep Transformers trainable. They are the "Add" in every "Add & Norm" step you've seen in the architecture.

💡 In one line: A residual connection adds a sub-layer's input back to its output, so information and gradients can flow straight through deep networks.

What is a Residual Connection?

It's a shortcut that adds the input of a sub-layer directly to its output:

output = x + SubLayer(x)


Instead of forcing the sub-layer to learn the entire transformation from scratch, it only has to learn the residual — the change to apply to x. The original input is carried forward untouched, so nothing important is lost.

Where They Are in the Transformer Block

Every sub-layer in a Transformer block is wrapped in a residual connection:

  • Around attention: x + Attention(x)
  • Around the FFN: x + FeedForward(x)

After the addition, layer normalization is applied (the next subtopic) — which is why the step is labelled "Add & Norm."

Why Residual Connections Matter

  • They fix vanishing gradients. During backpropagation, gradients can flow directly through the skip path, so even very deep stacks stay trainable.
  • They preserve information. The original input is always passed forward, so early features aren't lost deep in the network.
  • They make learning easier. Each sub-layer only learns a small adjustment (the residual), not a full transformation.
  • They enable depth. They're what allows Transformers to stack dozens or hundreds of layers successfully.

(The idea comes from ResNet (2015), which used skip connections to train very deep image networks — the Transformer borrowed it.)

The Intuition

Think of editing a document: instead of rewriting the whole page, you keep the original and add your changes on top. The original is never lost, and you only focus on what to change. A residual connection does the same — it keeps x and lets the sub-layer add improvements. You can also picture the skip path as a highway that lets gradients travel freely back through the network.

Code Example


That single x + is the whole trick — simple, but essential for deep models.

Key Points

AspectDetail
Formulaoutput = x + SubLayer(x)
Also calledSkip connection
Used aroundAttention and the FFN (each sub-layer)
Main benefitStable gradients + preserved information
The "Add" in"Add & Norm"
Summary
  • A residual connection adds a sub-layer's input back to its output: x + SubLayer(x).
  • It's the "Add" in "Add & Norm," wrapping both attention and the FFN.
  • It fixes vanishing gradients, preserves information, and makes each sub-layer only learn a small change.
  • This is what allows Transformers to be very deep and still train well.
  • It came from ResNet and is now standard in nearly all deep architectures.