Context Window
An LLM can only "see" a limited amount of text at once. The context window is the maximum number of tokens a model can process in a single request β and crucially, it includes both your input and its output. Everything the model knows about your current conversation must fit inside this window. It's one of the most important practical limits when working with LLMs.
π‘ In one line: The context window is the model's working memory β the maximum tokens (input + output) it can handle in one request.
What is the Context Window?
It's the token budget for a single request, measured in tokens (recall the Tokens & Embeddings article). It holds everything the model considers:
- Your prompt
- The conversation history
- Any documents you paste in
- The model's generated response
All of these share the same budget.
Input + Output Share the Budget
If the window is N tokens, then prompt + response β€ N. This has a practical consequence:
- A very long prompt leaves less room for the output.
- Reserving a long output leaves less room for input.
You're always budgeting both sides at once.
What Happens When You Exceed It?
When a conversation or document is too long for the window:
- Earlier tokens get truncated β the model effectively "forgets" the beginning.
- Or the request errors out.
In a long chat, the earliest messages can silently fall out of the window β which is why a model may "forget" something you said much earlier.
Why the Size Matters
A larger context window lets a model:
- Read and analyse longer documents at once.
- Hold longer conversations without forgetting.
- Fit more examples in the prompt (better few-shot learning).
A small window forces you to chunk or summarise to fit.
The Cost of Context
Bigger isn't free. Self-attention compares every token with every other token, so compute grows roughly quadratically with context length. Doubling the context can quadruple the work. That means longer context costs more memory, compute, latency, and money β which is why very large windows are expensive to offer.
Context Windows Have Grown
Early models handled only a few thousand tokens. Newer ones stretch to tens or hundreds of thousands, and some to millions. The trend has been rapid growth β but a bigger window doesn't automatically mean perfect recall (see below).
"Lost in the Middle"
Even within a large window, models often pay more attention to the start and end of a long context and less to the middle. So stuffing a huge document into the window doesn't guarantee the model uses all of it equally β placement of important information matters.
Working Within the Limits
Common techniques:
- Chunking β split long documents into smaller pieces.
- RAG β retrieve only the relevant chunks to put in context.
- Summarisation β compress earlier conversation to save tokens.
- Sliding window β keep only the most recent context.
Context vs. Knowledge
Don't confuse two kinds of "memory":
- Context window = what's in the prompt right now β temporary working memory, gone after the request.
- Knowledge = what's baked into the weights during training β permanent, but fixed at the cutoff.
Summary
- The context window is the max tokens a model handles per request β input + output together.
- Exceed it and early tokens are dropped (the model "forgets" the start).
- A bigger window enables longer documents and chats, but costs grow quadratically with length.
- Even large windows can suffer "lost in the middle" β position matters.
- Manage limits with chunking, RAG, summarisation, and sliding windows β and remember context (temporary) differs from trained knowledge (permanent).