Top-K Sampling

Last updated: Jun 30, 2026

Author :

Vinay Adari

Top-K

Top-k is an output control that restricts the model to choosing from only the k most likely tokens at each step. It's the simpler cousin of top-p: instead of an adaptive group, it keeps a fixed number of candidates and ignores everything else. This cuts out unlikely, low-quality tokens while still allowing some variety.

💡 In one line: Top-k keeps only the k most likely tokens and samples from those, discarding the rest.

What Top-K Does

At each step, top-k:

Sorts tokens by probability.
Keeps the top k (e.g. k = 40).
Discards the rest and renormalises the kept probabilities.
Samples the next token from those k.

So a token outside the top k has zero chance of being chosen, no matter what.

A Quick Range

k = 1 → greedy (always the single most likely token).
k = 40 → a common default with controlled variety.
Very large k → almost no filtering (all tokens considered).

Higher k means more diversity; lower k means more focus.

Top-K vs. Top-P

This is the key comparison:

	Top-k	Top-p (nucleus)
Keeps	A fixed number k	A set summing to p
Set size	Always the same	Adapts to confidence
Simplicity	Very simple	Slightly more involved

The limitation of top-k is that a fixed k can't adapt to how confident the model is:

When the model is confident, k might keep some unlikely tokens it shouldn't.
When the model is uncertain, k might cut off perfectly good tokens.

Top-p was designed to fix exactly this by adjusting the set size automatically.

Combining Controls

Top-k is often used with other controls:

Top-k + temperature — top-k limits the candidates, temperature sets randomness within them.
Some setups apply both top-k and top-p for extra control.

Code Example

Only the top-k tokens keep a non-zero probability; the rest are set to -inf (probability 0).

When to Use It

Simple and predictable — easy to reason about.
But because it's fixed, many modern setups prefer top-p, or combine the two.

Summary

Top-k keeps only the k most likely tokens and samples from them.
k = 1 is greedy; k ≈ 40 is a common default; larger k means more variety.
Its weakness is being fixed — it can't adapt to the model's confidence like top-p does.
It's often combined with temperature (and sometimes top-p).
Use it for simple, predictable filtering — but top-p is frequently preferred for its adaptiveness.