What is Context Window?

Fundamentals

Context Window

The maximum amount of text (measured in tokens) that a language model can process in a single request, including both the input prompt and the generated output.

The context window is one of the most important practical constraints of any large language model. It defines the total number of tokens — input plus output — that the model can handle in a single interaction. If you send a 10,000-token document and ask a question, and the model's context window is 16,000 tokens, you have roughly 6,000 tokens left for the response.

Early models like GPT-3 had context windows of around 4,096 tokens (roughly 3,000 words). Modern frontier models have dramatically expanded this: GPT-4 Turbo supports 128K tokens, Claude 3 offers 200K tokens, and Gemini 1.5 Pro can handle up to 1 million tokens. Larger context windows allow you to process entire codebases, long documents, or extended conversations without losing information.

However, a larger context window does not automatically mean better performance. Models can struggle with information buried in the middle of very long contexts — a phenomenon researchers call "lost in the middle." There are also cost implications: many providers charge per token, so filling a massive context window increases your API bill. The quality of attention paid to each token can degrade as the window grows.

When choosing a model, consider not just the raw context window size but how your use case interacts with it. Summarizing a single long document requires a large context window, while a chatbot handling short exchanges may not. Some applications use retrieval-augmented generation (RAG) to keep context windows manageable by only injecting the most relevant information.

Chain-of-Thought (CoT)

Distillation

Explore more AI concepts in the glossary

Browse Full Glossary