Top-K

Top-k is an output control that restricts the model to choosing from only the k most likely tokens at each step. It's the simpler cousin of top-p: instead of an adaptive group, it keeps a fixed number of candidates and ignores everything else. This cuts out unlikely, low-quality tokens while still allowing some variety.

💡 In one line: Top-k keeps only the k most likely tokens and samples from those, discarding the rest.

What Top-K Does

At each step, top-k:

  1. Sorts tokens by probability.
  2. Keeps the top k (e.g. k = 40).
  3. Discards the rest and renormalises the kept probabilities.
  4. Samples the next token from those k.

So a token outside the top k has zero chance of being chosen, no matter what.

A Quick Range

  • k = 1greedy (always the single most likely token).
  • k = 40 → a common default with controlled variety.
  • Very large k → almost no filtering (all tokens considered).

Higher k means more diversity; lower k means more focus.

Top-K vs. Top-P

This is the key comparison:

Top-kTop-p (nucleus)
KeepsA fixed number kA set summing to p
Set sizeAlways the sameAdapts to confidence
SimplicityVery simpleSlightly more involved

The limitation of top-k is that a fixed k can't adapt to how confident the model is:

  • When the model is confident, k might keep some unlikely tokens it shouldn't.
  • When the model is uncertain, k might cut off perfectly good tokens.

Top-p was designed to fix exactly this by adjusting the set size automatically.

Combining Controls

Top-k is often used with other controls:

  • Top-k + temperature — top-k limits the candidates, temperature sets randomness within them.
  • Some setups apply both top-k and top-p for extra control.

Code Example


Only the top-k tokens keep a non-zero probability; the rest are set to -inf (probability 0).

When to Use It

  • Simple and predictable — easy to reason about.
  • But because it's fixed, many modern setups prefer top-p, or combine the two.

Summary

  • Top-k keeps only the k most likely tokens and samples from them.
  • k = 1 is greedy; k ≈ 40 is a common default; larger k means more variety.
  • Its weakness is being fixed — it can't adapt to the model's confidence like top-p does.
  • It's often combined with temperature (and sometimes top-p).
  • Use it for simple, predictable filtering — but top-p is frequently preferred for its adaptiveness.