Top-K
Top-k is an output control that restricts the model to choosing from only the k most likely tokens at each step. It's the simpler cousin of top-p: instead of an adaptive group, it keeps a fixed number of candidates and ignores everything else. This cuts out unlikely, low-quality tokens while still allowing some variety.
💡 In one line: Top-k keeps only the k most likely tokens and samples from those, discarding the rest.
What Top-K Does
At each step, top-k:
- Sorts tokens by probability.
- Keeps the top k (e.g. k = 40).
- Discards the rest and renormalises the kept probabilities.
- Samples the next token from those k.
So a token outside the top k has zero chance of being chosen, no matter what.
A Quick Range
- k = 1 → greedy (always the single most likely token).
- k = 40 → a common default with controlled variety.
- Very large k → almost no filtering (all tokens considered).
Higher k means more diversity; lower k means more focus.
Top-K vs. Top-P
This is the key comparison:
| Top-k | Top-p (nucleus) | |
|---|---|---|
| Keeps | A fixed number k | A set summing to p |
| Set size | Always the same | Adapts to confidence |
| Simplicity | Very simple | Slightly more involved |
The limitation of top-k is that a fixed k can't adapt to how confident the model is:
- When the model is confident, k might keep some unlikely tokens it shouldn't.
- When the model is uncertain, k might cut off perfectly good tokens.
Top-p was designed to fix exactly this by adjusting the set size automatically.
Combining Controls
Top-k is often used with other controls:
- Top-k + temperature — top-k limits the candidates, temperature sets randomness within them.
- Some setups apply both top-k and top-p for extra control.
Code Example
Only the top-k tokens keep a non-zero probability; the rest are set to -inf (probability 0).
When to Use It
- Simple and predictable — easy to reason about.
- But because it's fixed, many modern setups prefer top-p, or combine the two.
Summary
- Top-k keeps only the k most likely tokens and samples from them.
- k = 1 is greedy; k ≈ 40 is a common default; larger k means more variety.
- Its weakness is being fixed — it can't adapt to the model's confidence like top-p does.
- It's often combined with temperature (and sometimes top-p).
- Use it for simple, predictable filtering — but top-p is frequently preferred for its adaptiveness.