Cross-Attention

Self-attention lets tokens attend within one sequence. Cross-attention lets one sequence attend to a different sequence. It's the mechanism that connects the encoder and decoder — letting the decoder, while generating output, look back at the encoder's understanding of the input. If self-attention is how a sequence understands itself, cross-attention is how the output stays connected to the input.

💡 In one line: Cross-attention lets the decoder attend to the encoder's output — the bridge that keeps generated output aligned with the input.

Self-Attention vs. Cross-Attention

The difference is just where Q, K, and V come from:

Self-AttentionCross-Attention
Query (Q)The same sequenceThe decoder (target)
Key (K)The same sequenceThe encoder (source)
Value (V)The same sequenceThe encoder (source)

The actual computation — scaled dot-product attention — is identical. Only the sources of the vectors change: in cross-attention, the Query comes from one sequence and the Key/Value from another.

Where Cross-Attention Lives

Cross-attention sits inside the decoder block of encoder–decoder models (the original Transformer, T5). Recall the decoder block order:

  1. Masked self-attention (the decoder attends to itself)
  2. Cross-attention (the decoder attends to the encoder's output)
  3. Feed-Forward Network

It also appears in multimodal models — for example, letting text tokens attend to image features.

How It Works

  • The Query comes from the decoder's current state — what it's trying to generate right now.
  • The Keys and Values come from the encoder's outputthe understood input.
  • The decoder token effectively asks: "Which parts of the input are relevant to what I'm generating now?" — then attends to them and blends in their values.

Translation Example

Translating "The cat sat""Die Katze saß": when the decoder generates "Katze" (German for cat), cross-attention focuses on "cat" in the source. This alignment is exactly how the output stays faithful to the input — the decoder pulls the right meaning from the encoder at each step.

QKV in Cross-Attention

VectorComes from
Query (Q)Decoder (the target being generated)
Key (K)Encoder output (the source)
Value (V)Encoder output (the source)

Code Example


Notice the attention matrix is 7 × 10 (target × source) — each generated token attends across all input tokens.

Why It Matters

  • Connects input understanding to output generation.
  • Keeps outputs faithful to inputs — essential for translation and summarisation.
  • Enables multimodal AI — e.g. an image caption attending to image regions.

Summary

  • Cross-attention lets one sequence (the decoder) attend to another (the encoder's output).
  • Only the sources differ from self-attention: Query from the decoder, Key/Value from the encoder.
  • It lives in the decoder block (after masked self-attention) and is the bridge between input and output.
  • In translation, it aligns each generated word with the relevant input word.
  • It's the same scaled dot-product attention — applied across two sequences instead of within one.