Model Parameters

Last updated: Jun 29, 2026

Author :

Vinay Adari

Model Parameters (Generation Settings)

Last updated: Jun 21, 2026 Author: Aspirant Edu Team

When you use an LLM, you can control how it generates text through a set of adjustable settings — often called model parameters, generation parameters, or inference parameters. Importantly, these do not change the model's trained weights. They tune its behaviour at generation time: how random, how long, and how focused the output is. Learning these settings is what lets you get precise, factual answers in one moment and creative, varied writing the next — from the same model.

💡 In one line: Generation parameters like temperature and top-p control how an LLM picks its next token — shaping output without changing the model itself.

Two Meanings of "Parameters"

It's worth separating two ideas that share the same word:

Trainable parameters (weights) — the billions of learned numbers inside the model, fixed after training. (Covered in Parameters & Model Size.)
Generation parameters — the settings you adjust when calling the model to shape its output. This article covers these.

These generation settings act directly on the next-token prediction step (see that article): they decide how a token is chosen from the model's probability distribution.

Temperature

Temperature controls randomness. It scales the logits before softmax:

Low (→ 0): the distribution becomes peaked — the model almost always picks the top token. Focused, deterministic.
High (→ 1 and above): the distribution flattens — less likely tokens get a real chance. Creative, varied, riskier.

Top-k and Top-p

These limit which tokens can be sampled:

Top-k — only consider the k most likely tokens, ignore the rest. (e.g. k = 40)
Top-p (nucleus) — consider the smallest set of tokens whose probabilities add up to p. (e.g. p = 0.9) It adapts: few tokens when the model is confident, more when it's unsure.

Top-p is the more popular default because it adjusts automatically to each step.

Max Tokens

Max tokens caps the length of the generated output. It's essential for controlling cost and response size, since longer outputs use more compute.

Frequency and Presence Penalties

These reduce repetition:

Frequency penalty — penalises a token the more often it has already appeared (discourages repeating the same words).
Presence penalty — penalises a token if it has appeared at all (encourages introducing new topics).

Stop Sequences

Stop sequences are strings that tell the model to stop generating as soon as they appear — useful for ending output cleanly (e.g. stop at "\n\n" or a custom marker).

Quick Reference

Parameter	What it does	Typical range
Temperature	Randomness of choices	0 – 2 (often 0 – 1)
Top-k	Limit to top k tokens	1 – 100
Top-p	Nucleus sampling threshold	0 – 1
Max tokens	Cap output length	task-dependent
Frequency penalty	Reduce repeated words	0 – 2
Presence penalty	Encourage new topics	0 – 2
Stop sequences	End generation	string(s)

Code Example

These settings are passed per request — you can change them anytime without retraining.

Practical Tips

Factual / precise tasks → low temperature (0 – 0.3).
Creative tasks → higher temperature (0.7 – 1.0).
Use top-p ≈ 0.9 as a solid default.
Set max tokens to control length and cost.
Usually combine temperature with top-p (or top-k) — not all three aggressively at once.

Summary

Generation parameters control how an LLM produces output — without changing its weights.
Temperature sets randomness; top-k and top-p limit which tokens can be sampled.
Max tokens caps length; frequency/presence penalties reduce repetition; stop sequences end output.
They act on the next-token prediction step, picking from the probability distribution.
Tune them per task: low temperature for facts, higher for creativity.