Top-P (Nucleus Sampling)

Top-p, also called nucleus sampling, is an output control that limits which tokens the model is allowed to choose from. Instead of considering every possible token, it keeps only the most likely ones β€” the smallest group whose probabilities add up to p β€” and samples from that group. Its clever feature is that the size of that group adapts to how confident the model is at each step.

πŸ’‘ In one line: Top-p keeps only the smallest set of top tokens whose probabilities sum to p, then samples from that set β€” and the set size adapts to the model's confidence.

What Top-P Does

After the model produces its probability distribution, top-p:

  1. Sorts tokens from most to least likely.
  2. Adds them up until the cumulative probability reaches p (e.g. 0.9).
  3. Keeps that set β€” the nucleus β€” and discards the rest.
  4. Samples the next token from the nucleus only.

So the unlikely "tail" of weird tokens is cut off, while natural variety is preserved.

Why It's Adaptive (the Key Idea)

The crucial thing about top-p is that the nucleus size changes from step to step:

  • When the model is confident (one token dominates), few tokens are needed to reach p β†’ output stays focused.
  • When the model is unsure (probabilities are spread out), many tokens are needed to reach p β†’ output gets more variety.

This automatic adjustment is exactly why top-p is so popular.

Top-P vs. Temperature

They do different jobs and are often used together:

  • Temperature reshapes the whole distribution (makes it peaked or flat).
  • Top-p truncates the distribution (cuts off the unlikely tail).

A common setup: temperature sets the randomness, top-p removes the worst options.

Top-P vs. Top-K

Top-p (nucleus)Top-k
KeepsA set summing to pA fixed number k of tokens
Set sizeAdapts to confidenceAlways the same
StrengthFlexible, context-awareSimple and predictable

Top-p adapts to each step; top-k always keeps the same count regardless of confidence.

Typical Values

  • p = 1.0 β€” consider all tokens (no filtering).
  • p = 0.9 β€” a common default; keep the top ~90% of probability mass.
  • p = 0.5 β€” more focused and conservative.

Lower p β†’ more focused/safe; higher p β†’ more diverse.

Why Use Top-P?

  • Removes the unlikely tail β€” avoids strange or incoherent tokens.
  • Keeps natural variety β€” doesn't over-restrict like a small fixed k might.
  • Adapts automatically β€” no need to guess the right number of candidates.

Code Example


This builds the nucleus by accumulating probabilities up to p. (Simplified β€” real implementations also keep the token that crosses p.)

Tips

  • p β‰ˆ 0.9 is a solid default for natural, varied text.
  • Combine it with a sensible temperature rather than maximising both.
  • Lower p for more precise, predictable output.

Summary

  • Top-p (nucleus sampling) keeps the smallest set of top tokens summing to p, then samples from it.
  • Its nucleus size adapts: small when the model is confident, larger when it's unsure.
  • It truncates the distribution (vs. temperature, which reshapes it) and is often used together with temperature.
  • Unlike top-k (a fixed count), top-p adjusts automatically to context.
  • A typical default is p β‰ˆ 0.9; lower for focus, higher for diversity.