Query, Key, Value (QKV)
In the self-attention article, we saw tokens score, weight, and blend each other. But how are those scores actually computed? The answer is the mechanism at the core of attention: every token produces three vectors — a Query (Q), a Key (K), and a Value (V). Understanding QKV is understanding how attention really works.
💡 In one line: Each token asks a question (Query), every token offers a label (Key) and some content (Value); matching queries to keys decides how much of each value to blend in.
The Search Analogy
The easiest way to grasp QKV is to think of searching a library:
- Query — what you're looking for ("books about space").
- Key — the label or index of each item (each book's topic tag).
- Value — the actual content (the book itself).
You compare your query against every key to see how relevant each item is, then retrieve a blend of the values, weighted by relevance.
Attention does exactly this: each token creates a Query (what it's looking for), and every token offers a Key (what it contains) and a Value (its information).
Where Q, K, V Come From
Q, K, and V are all derived from the same token embedding x, by multiplying it with three learned weight matrices:
Q = x · W_Q K = x · W_K V = x · W_V
So Q, K, and V are three different learned "views" of the same token. The matrices W_Q, W_K, W_V are learned during training — they're what the model tunes to make attention useful.
How Attention Uses Q, K, V
The full process — scaled dot-product attention — has four steps:
- Scores — take the dot product of each Query with every Key:
Q · Kᵀ. A high dot product means high relevance. - Scale — divide by
√dₖ(the square root of the key dimension) to keep the numbers stable. - Softmax — turn the scores into weights that sum to 1.
- Weighted sum — multiply the weights by the Values and add them up → the output.
Put together, that's the famous formula:
Attention(Q, K, V) = softmax( Q·Kᵀ / √dₖ ) · V
Why Three Different Vectors?
Why not just compare tokens directly? Separating Q, K, and V adds crucial flexibility:
- Query ≠ Key — a token can ask for something different from what it offers. The word "it" can query for "a nearby noun" while offering its own key.
- Value ≠ Key — the signal used for matching (Key) can differ from the information actually retrieved (Value).
This separation is what makes attention so expressive.
Why Scale by √dₖ?
When dₖ is large, the dot products can become very big, which pushes softmax into regions where gradients are tiny (saturation). Dividing by √dₖ keeps the scores in a sensible range, so training stays stable.
Code Example
This is the complete attention computation in a handful of lines.
Q, K, V at a Glance
| Vector | Role | Analogy |
|---|---|---|
| Query (Q) | What this token is looking for | Your search request |
| Key (K) | What each token offers, for matching | A book's index label |
| Value (V) | The information each token provides | The book's content |
Summary
- Every token produces three vectors — Query, Key, Value — via learned matrices
W_Q,W_K,W_V. - Attention = match Queries to Keys (dot product), scale, softmax, then blend the Values.
- The formula:
Attention(Q,K,V) = softmax(Q·Kᵀ / √dₖ) · V. - Keeping Q, K, V separate lets a token ask, offer, and provide different things — making attention expressive.
- This QKV mechanism is run many times in parallel as multi-head attention — covered next.