Deployment

Latency

The time delay between sending a request to an AI model and receiving the response. Low latency is critical for real-time applications like chatbots and coding assistants.

Latency in AI systems refers to the time it takes from when a user sends a prompt to when they receive a response. For language models, latency has two important components: time-to-first-token (TTFT) — how quickly the model starts responding — and total generation time, which depends on how many tokens the response contains and the model's throughput (tokens per second).

Time-to-first-token is especially important for interactive applications. When a user is chatting with an AI assistant, the perceived responsiveness depends on how quickly the first words appear. TTFT includes network latency, queue wait time at the provider, and the model's prefill computation. Frontier models like GPT-4 and Claude typically have TTFT ranging from 0.3 to 2 seconds depending on prompt length and server load. Smaller models and optimized deployments can achieve sub-100ms TTFT.

Several factors affect latency. Model size is the biggest: larger models require more computation per token. Prompt length matters because the prefill phase scales with input size. Geographic distance to the API server adds network latency. Server load and batching strategies at the provider also play a role — during peak hours, requests may queue. Some providers offer different latency tiers or dedicated endpoints for latency-sensitive applications.

For developers building AI-powered products, latency requirements should guide model selection. A customer-facing chatbot might need TTFT under 500ms, making smaller or optimized models preferable. A background document processing pipeline can tolerate higher latency in exchange for better accuracy from a larger model. Streaming responses (displaying tokens as they are generated) significantly improves perceived responsiveness even when total generation time is high. When comparing models on GPTCrunch, look at both benchmark scores and provider-reported latency metrics to find the right balance for your application.

Inference

Mixture of Experts (MoE)

Explore more AI concepts in the glossary

Browse Full Glossary