Query, Key, Value (QKV)

Last updated: Jun 29, 2026

Author :

Vinay Adari

Query, Key, Value (QKV)

In the self-attention article, we saw tokens score, weight, and blend each other. But how are those scores actually computed? The answer is the mechanism at the core of attention: every token produces three vectors — a Query (Q), a Key (K), and a Value (V). Understanding QKV is understanding how attention really works.

💡 In one line: Each token asks a question (Query), every token offers a label (Key) and some content (Value); matching queries to keys decides how much of each value to blend in.

The Search Analogy

The easiest way to grasp QKV is to think of searching a library:

Query — what you're looking for ("books about space").
Key — the label or index of each item (each book's topic tag).
Value — the actual content (the book itself).

You compare your query against every key to see how relevant each item is, then retrieve a blend of the values, weighted by relevance.

Attention does exactly this: each token creates a Query (what it's looking for), and every token offers a Key (what it contains) and a Value (its information).

Where Q, K, V Come From

Q, K, and V are all derived from the same token embedding x, by multiplying it with three learned weight matrices:

Q = x · W_Q      K = x · W_K      V = x · W_V

So Q, K, and V are three different learned "views" of the same token. The matrices W_Q, W_K, W_V are learned during training — they're what the model tunes to make attention useful.

How Attention Uses Q, K, V

The full process — scaled dot-product attention — has four steps:

Scores — take the dot product of each Query with every Key: Q · Kᵀ. A high dot product means high relevance.
Scale — divide by √dₖ (the square root of the key dimension) to keep the numbers stable.
Softmax — turn the scores into weights that sum to 1.
Weighted sum — multiply the weights by the Values and add them up → the output.

Put together, that's the famous formula:

Attention(Q, K, V) = softmax( Q·Kᵀ / √dₖ ) · V

Why Three Different Vectors?

Why not just compare tokens directly? Separating Q, K, and V adds crucial flexibility:

Query ≠ Key — a token can ask for something different from what it offers. The word "it" can query for "a nearby noun" while offering its own key.
Value ≠ Key — the signal used for matching (Key) can differ from the information actually retrieved (Value).

This separation is what makes attention so expressive.

Why Scale by √dₖ?

When dₖ is large, the dot products can become very big, which pushes softmax into regions where gradients are tiny (saturation). Dividing by √dₖ keeps the scores in a sensible range, so training stays stable.

Code Example

This is the complete attention computation in a handful of lines.

Q, K, V at a Glance

Vector	Role	Analogy
Query (Q)	What this token is looking for	Your search request
Key (K)	What each token offers, for matching	A book's index label
Value (V)	The information each token provides	The book's content

Summary

Every token produces three vectors — Query, Key, Value — via learned matrices W_Q, W_K, W_V.
Attention = match Queries to Keys (dot product), scale, softmax, then blend the Values.
The formula: Attention(Q,K,V) = softmax(Q·Kᵀ / √dₖ) · V.
Keeping Q, K, V separate lets a token ask, offer, and provide different things — making attention expressive.
This QKV mechanism is run many times in parallel as multi-head attention — covered next.