Max Tokens

Max tokens caps how long the model's response can be. Unlike temperature, top-p, and top-k β€” which control which tokens are chosen β€” max tokens controls how many tokens are generated. It's a simple but essential setting for managing cost, speed, and runaway output.

πŸ’‘ In one line: Max tokens sets the maximum length of the model's response β€” generation stops when it's reached.

What Max Tokens Does

It sets an upper limit on the number of tokens the model will generate. Generation stops at whichever comes first:

  1. The model reaches the max tokens limit, or
  2. The model produces an end-of-sequence / stop token (it finished naturally).

.

It Controls Output Length Only

Max tokens limits the output β€” the tokens the model generates. It does not limit your input (the prompt). But remember: both the input and the output share the context window (see Context Window).

Relationship to the Context Window

A key constraint:

input tokens  +  max output tokens  ≀  context window

If you set max tokens too high alongside a long prompt, you can exceed the window. Always leave room for the output within the budget.

What Happens When the Limit Is Hit

If the model reaches max tokens before finishing, the output is simply cut off β€” often mid-sentence. The model does not wrap up or summarise; it just stops. So set the limit high enough for a complete answer.

Why It Matters

  • Cost β€” you pay per output token, so the cap controls spend.
  • Latency β€” fewer tokens means a faster response.
  • Safety β€” prevents runaway, excessively long outputs.
  • Predictability β€” bounds the size of every response.

Max Tokens vs. Stop Sequences

Both end generation, but differently:

Max tokensStop sequences
Triggers onA token countA specific string appearing
TypeHard length capContent-based stop

They're often used together β€” a stop sequence ends output cleanly, max tokens is the safety cap.

Practical Guidance

  • Set it to your expected answer length + a buffer.
  • Short answers β†’ small (e.g. 50–100 tokens).
  • Long content β†’ larger (e.g. 1000+ tokens).
  • Too low β†’ truncated answers; too high β†’ wasted budget and possible window overflow.

Summary

  • Max tokens caps the number of tokens the model generates.
  • Generation stops at the limit or when the model finishes naturally β€” whichever is first.
  • It limits the output only, but input + output must fit the context window.
  • Hitting the limit truncates the response (possibly mid-sentence), so allow enough room.
  • It controls cost, speed, and length β€” and pairs with stop sequences for clean endings.