Cross-Attention
Self-attention lets tokens attend within one sequence. Cross-attention lets one sequence attend to a different sequence. It's the mechanism that connects the encoder and decoder — letting the decoder, while generating output, look back at the encoder's understanding of the input. If self-attention is how a sequence understands itself, cross-attention is how the output stays connected to the input.
💡 In one line: Cross-attention lets the decoder attend to the encoder's output — the bridge that keeps generated output aligned with the input.
Self-Attention vs. Cross-Attention
The difference is just where Q, K, and V come from:
| Self-Attention | Cross-Attention | |
|---|---|---|
| Query (Q) | The same sequence | The decoder (target) |
| Key (K) | The same sequence | The encoder (source) |
| Value (V) | The same sequence | The encoder (source) |
The actual computation — scaled dot-product attention — is identical. Only the sources of the vectors change: in cross-attention, the Query comes from one sequence and the Key/Value from another.
Where Cross-Attention Lives
Cross-attention sits inside the decoder block of encoder–decoder models (the original Transformer, T5). Recall the decoder block order:
- Masked self-attention (the decoder attends to itself)
- Cross-attention (the decoder attends to the encoder's output)
- Feed-Forward Network
It also appears in multimodal models — for example, letting text tokens attend to image features.
How It Works
- The Query comes from the decoder's current state — what it's trying to generate right now.
- The Keys and Values come from the encoder's output — the understood input.
- The decoder token effectively asks: "Which parts of the input are relevant to what I'm generating now?" — then attends to them and blends in their values.
Translation Example
Translating "The cat sat" → "Die Katze saß": when the decoder generates "Katze" (German for cat), cross-attention focuses on "cat" in the source. This alignment is exactly how the output stays faithful to the input — the decoder pulls the right meaning from the encoder at each step.
QKV in Cross-Attention
| Vector | Comes from |
|---|---|
| Query (Q) | Decoder (the target being generated) |
| Key (K) | Encoder output (the source) |
| Value (V) | Encoder output (the source) |
Code Example
Notice the attention matrix is 7 × 10 (target × source) — each generated token attends across all input tokens.
Why It Matters
- Connects input understanding to output generation.
- Keeps outputs faithful to inputs — essential for translation and summarisation.
- Enables multimodal AI — e.g. an image caption attending to image regions.
Summary
- Cross-attention lets one sequence (the decoder) attend to another (the encoder's output).
- Only the sources differ from self-attention: Query from the decoder, Key/Value from the encoder.
- It lives in the decoder block (after masked self-attention) and is the bridge between input and output.
- In translation, it aligns each generated word with the relevant input word.
- It's the same scaled dot-product attention — applied across two sequences instead of within one.