Next Token Prediction
Everything inside a generative Transformer builds up to one deceptively simple task: predict the next token. Given a sequence of tokens, the model produces a probability for every possible next token, picks one, and repeats. This single mechanism β next token prediction β is what powers text generation in all large language models. Master it, and you understand how an LLM actually "writes."
π‘ In one line: An LLM generates text by repeatedly predicting the most likely next token and adding it to the sequence.
The Final Step: From Vectors to a Token
After the input passes through all the Transformer blocks, each token has a final, context-rich vector. To predict the next token, the model takes the last token's vector and runs it through an output head:
- Linear layer β projects the vector to the size of the vocabulary, producing one score per possible token. These raw scores are called logits.
- Softmax β converts the logits into a probability distribution over the whole vocabulary (all values between 0 and 1, summing to 1).
- Select β choose a token from that distribution.
Logits and Softmax
- Logits are the raw, unbounded scores β one per token in the vocabulary.
- Softmax turns them into probabilities. A token with a higher logit gets a higher probability.
So the model never outputs a word directly β it outputs a probability for every word it knows.
How a Token is Chosen (Decoding Strategies)
Once we have the probabilities, how do we pick? This is decoding, and the choice shapes the output's style:
| Strategy | How it picks | Effect |
|---|---|---|
| Greedy | Always the highest-probability token | Safe, but can be repetitive |
| Temperature | Scales randomness before softmax | Low = focused, high = creative |
| Top-k | Sample from the top k tokens | Controlled variety |
| Top-p (nucleus) | Sample from the smallest set summing to p | Adaptive variety |
This is why the same prompt can give different answers β and why a "temperature" setting makes a model more creative or more deterministic.
The Generation Loop (Autoregressive)
Generation is a loop (the same one from the GPT article):
- Predict the next token.
- Append it to the sequence.
- Feed the longer sequence back in.
- Predict again β and repeat until a stop signal.
One token at a time, the model writes a whole response.
How Training Works
The model learns next token prediction from huge amounts of text. For every position, it predicts the next token and is corrected using cross-entropy loss β the gap between its predicted distribution and the actual next token. Repeated across billions of tokens, this teaches the model grammar, facts, reasoning, and style. No labels are needed β the text itself provides the answers.
Why Such a Simple Task Is So Powerful
To predict the next token well across all of human writing, a model has to implicitly learn an enormous amount: language structure, world facts, logic, even reasoning patterns. Next token prediction at scale is what gives rise to the surprising, emergent abilities of large language models. The entire Transformer exists to do this one thing β extremely well.
Code Example
This shows the heart of decoding: logits β softmax β pick (greedily or by sampling).
Summary
- A generative Transformer works by predicting the next token, over and over.
- The output head turns the final vector into logits (via Linear), then probabilities (via Softmax).
- A decoding strategy (greedy, temperature, top-k, top-p) selects the actual token.
- Generation is an autoregressive loop: predict β append β repeat.
- Trained with cross-entropy loss, this single objective β at scale β produces the remarkable abilities of modern LLMs.
This completes the Transformer series β from the limitations of RNNs, through the architecture, attention, and variants, to how a model finally predicts the next token. Together, these pieces explain how modern Generative AI actually works.