Next Token Prediction

Last updated: Jun 29, 2026

Author :

Vinay Adari

Next Token Prediction

Everything inside a generative Transformer builds up to one deceptively simple task: predict the next token. Given a sequence of tokens, the model produces a probability for every possible next token, picks one, and repeats. This single mechanism — next token prediction — is what powers text generation in all large language models. Master it, and you understand how an LLM actually "writes."

💡 In one line: An LLM generates text by repeatedly predicting the most likely next token and adding it to the sequence.

The Final Step: From Vectors to a Token

After the input passes through all the Transformer blocks, each token has a final, context-rich vector. To predict the next token, the model takes the last token's vector and runs it through an output head:

Linear layer — projects the vector to the size of the vocabulary, producing one score per possible token. These raw scores are called logits.
Softmax — converts the logits into a probability distribution over the whole vocabulary (all values between 0 and 1, summing to 1).
Select — choose a token from that distribution.

Logits and Softmax

Logits are the raw, unbounded scores — one per token in the vocabulary.
Softmax turns them into probabilities. A token with a higher logit gets a higher probability.

So the model never outputs a word directly — it outputs a probability for every word it knows.

How a Token is Chosen (Decoding Strategies)

Once we have the probabilities, how do we pick? This is decoding, and the choice shapes the output's style:

Strategy	How it picks	Effect
Greedy	Always the highest-probability token	Safe, but can be repetitive
Temperature	Scales randomness before softmax	Low = focused, high = creative
Top-k	Sample from the top k tokens	Controlled variety
Top-p (nucleus)	Sample from the smallest set summing to p	Adaptive variety

This is why the same prompt can give different answers — and why a "temperature" setting makes a model more creative or more deterministic.

The Generation Loop (Autoregressive)

Generation is a loop (the same one from the GPT article):

Predict the next token.
Append it to the sequence.
Feed the longer sequence back in.
Predict again — and repeat until a stop signal.

One token at a time, the model writes a whole response.

How Training Works

The model learns next token prediction from huge amounts of text. For every position, it predicts the next token and is corrected using cross-entropy loss — the gap between its predicted distribution and the actual next token. Repeated across billions of tokens, this teaches the model grammar, facts, reasoning, and style. No labels are needed — the text itself provides the answers.

Why Such a Simple Task Is So Powerful

To predict the next token well across all of human writing, a model has to implicitly learn an enormous amount: language structure, world facts, logic, even reasoning patterns. Next token prediction at scale is what gives rise to the surprising, emergent abilities of large language models. The entire Transformer exists to do this one thing — extremely well.

Code Example

This shows the heart of decoding: logits → softmax → pick (greedily or by sampling).

Summary

A generative Transformer works by predicting the next token, over and over.
The output head turns the final vector into logits (via Linear), then probabilities (via Softmax).
A decoding strategy (greedy, temperature, top-k, top-p) selects the actual token.
Generation is an autoregressive loop: predict → append → repeat.
Trained with cross-entropy loss, this single objective — at scale — produces the remarkable abilities of modern LLMs.

This completes the Transformer series — from the limitations of RNNs, through the architecture, attention, and variants, to how a model finally predicts the next token. Together, these pieces explain how modern Generative AI actually works.