Deployment

Throughput

The rate at which an AI model generates output, typically measured in tokens per second. Higher throughput means faster response generation and the ability to serve more concurrent users.

Throughput measures how many tokens a language model can produce per unit of time, usually expressed as tokens per second (tok/s). While latency measures the delay for a single request, throughput captures the system's overall capacity — how much total work it can do. For individual users, throughput determines how quickly a long response appears; for service providers, it determines how many users can be served simultaneously.

Individual request throughput for modern LLMs typically ranges from 20 to 150+ tokens per second, depending on model size, hardware, and optimization. Smaller models like GPT-4o-mini or Claude 3 Haiku can generate 100+ tokens per second, while larger models like GPT-4 or Claude 3 Opus are slower at 30-60 tokens per second. Since the average English word is about 1.3 tokens, 50 tok/s translates to roughly 38 words per second — faster than most people can read.

System-level throughput is measured differently, often as total tokens per second across all concurrent requests. Techniques like continuous batching (dynamically grouping requests to maximize GPU utilization), speculative decoding (using a small model to draft tokens that a large model quickly verifies), and KV-cache sharing can dramatically increase system throughput without changing the underlying model. This is why the same model can feel faster or slower depending on which provider is hosting it.

When evaluating models for production use, throughput directly impacts cost and user experience. Higher throughput means each GPU can serve more requests, reducing cost per token. For batch processing (analyzing thousands of documents, for example), throughput is often more important than latency. Many providers now offer "batch" API endpoints at reduced prices, explicitly trading latency for throughput by processing requests during off-peak hours.

Temperature

Tokens

Explore more AI concepts in the glossary

Browse Full Glossary