Cross-Attention

Last updated: Jun 29, 2026

Author :

Vinay Adari

Cross-Attention

Self-attention lets tokens attend within one sequence. Cross-attention lets one sequence attend to a different sequence. It's the mechanism that connects the encoder and decoder — letting the decoder, while generating output, look back at the encoder's understanding of the input. If self-attention is how a sequence understands itself, cross-attention is how the output stays connected to the input.

💡 In one line: Cross-attention lets the decoder attend to the encoder's output — the bridge that keeps generated output aligned with the input.

Self-Attention vs. Cross-Attention

The difference is just where Q, K, and V come from:

	Self-Attention	Cross-Attention
Query (Q)	The same sequence	The decoder (target)
Key (K)	The same sequence	The encoder (source)
Value (V)	The same sequence	The encoder (source)

The actual computation — scaled dot-product attention — is identical. Only the sources of the vectors change: in cross-attention, the Query comes from one sequence and the Key/Value from another.

Where Cross-Attention Lives

Cross-attention sits inside the decoder block of encoder–decoder models (the original Transformer, T5). Recall the decoder block order:

Masked self-attention (the decoder attends to itself)
Cross-attention (the decoder attends to the encoder's output)
Feed-Forward Network

It also appears in multimodal models — for example, letting text tokens attend to image features.

How It Works

The Query comes from the decoder's current state — what it's trying to generate right now.
The Keys and Values come from the encoder's output — the understood input.
The decoder token effectively asks: "Which parts of the input are relevant to what I'm generating now?" — then attends to them and blends in their values.

Translation Example

Translating "The cat sat" → "Die Katze saß": when the decoder generates "Katze" (German for cat), cross-attention focuses on "cat" in the source. This alignment is exactly how the output stays faithful to the input — the decoder pulls the right meaning from the encoder at each step.

QKV in Cross-Attention

Vector	Comes from
Query (Q)	Decoder (the target being generated)
Key (K)	Encoder output (the source)
Value (V)	Encoder output (the source)

Code Example

Notice the attention matrix is 7 × 10 (target × source) — each generated token attends across all input tokens.

Why It Matters

Connects input understanding to output generation.
Keeps outputs faithful to inputs — essential for translation and summarisation.
Enables multimodal AI — e.g. an image caption attending to image regions.

Summary

Cross-attention lets one sequence (the decoder) attend to another (the encoder's output).
Only the sources differ from self-attention: Query from the decoder, Key/Value from the encoder.
It lives in the decoder block (after masked self-attention) and is the bridge between input and output.
In translation, it aligns each generated word with the relevant input word.
It's the same scaled dot-product attention — applied across two sequences instead of within one.