Next Token Prediction

Everything inside a generative Transformer builds up to one deceptively simple task: predict the next token. Given a sequence of tokens, the model produces a probability for every possible next token, picks one, and repeats. This single mechanism β€” next token prediction β€” is what powers text generation in all large language models. Master it, and you understand how an LLM actually "writes."

πŸ’‘ In one line: An LLM generates text by repeatedly predicting the most likely next token and adding it to the sequence.

The Final Step: From Vectors to a Token

After the input passes through all the Transformer blocks, each token has a final, context-rich vector. To predict the next token, the model takes the last token's vector and runs it through an output head:

  1. Linear layer β€” projects the vector to the size of the vocabulary, producing one score per possible token. These raw scores are called logits.
  2. Softmax β€” converts the logits into a probability distribution over the whole vocabulary (all values between 0 and 1, summing to 1).
  3. Select β€” choose a token from that distribution.

Logits and Softmax

  • Logits are the raw, unbounded scores β€” one per token in the vocabulary.
  • Softmax turns them into probabilities. A token with a higher logit gets a higher probability.

So the model never outputs a word directly β€” it outputs a probability for every word it knows.

How a Token is Chosen (Decoding Strategies)

Once we have the probabilities, how do we pick? This is decoding, and the choice shapes the output's style:

StrategyHow it picksEffect
GreedyAlways the highest-probability tokenSafe, but can be repetitive
TemperatureScales randomness before softmaxLow = focused, high = creative
Top-kSample from the top k tokensControlled variety
Top-p (nucleus)Sample from the smallest set summing to pAdaptive variety

This is why the same prompt can give different answers β€” and why a "temperature" setting makes a model more creative or more deterministic.

The Generation Loop (Autoregressive)

Generation is a loop (the same one from the GPT article):

  1. Predict the next token.
  2. Append it to the sequence.
  3. Feed the longer sequence back in.
  4. Predict again β€” and repeat until a stop signal.

One token at a time, the model writes a whole response.

How Training Works

The model learns next token prediction from huge amounts of text. For every position, it predicts the next token and is corrected using cross-entropy loss β€” the gap between its predicted distribution and the actual next token. Repeated across billions of tokens, this teaches the model grammar, facts, reasoning, and style. No labels are needed β€” the text itself provides the answers.

Why Such a Simple Task Is So Powerful

To predict the next token well across all of human writing, a model has to implicitly learn an enormous amount: language structure, world facts, logic, even reasoning patterns. Next token prediction at scale is what gives rise to the surprising, emergent abilities of large language models. The entire Transformer exists to do this one thing β€” extremely well.

Code Example


This shows the heart of decoding: logits β†’ softmax β†’ pick (greedily or by sampling).

Summary

  • A generative Transformer works by predicting the next token, over and over.
  • The output head turns the final vector into logits (via Linear), then probabilities (via Softmax).
  • A decoding strategy (greedy, temperature, top-k, top-p) selects the actual token.
  • Generation is an autoregressive loop: predict β†’ append β†’ repeat.
  • Trained with cross-entropy loss, this single objective β€” at scale β€” produces the remarkable abilities of modern LLMs.

This completes the Transformer series β€” from the limitations of RNNs, through the architecture, attention, and variants, to how a model finally predicts the next token. Together, these pieces explain how modern Generative AI actually works.