LLM API Latency Compared: Speed Benchmarks 2026
Speed matters. A 200ms difference in API response time can mean the difference between a fluid chat experience and a frustrating one. Here's how every major LLM provider compares on real-world latency — and how speed intersects with cost.
Understanding LLM Latency
API latency has three components:
- Time to First Token (TTFT) — how long until the model starts streaming. This is what users perceive as "wait time."
- Inter-Token Latency — how fast each subsequent token arrives. Measured in tokens per second (tok/s).
- Total Response Time — end-to-end time for a complete response. Depends on output length.
For chat applications, TTFT is the most important metric — users judge responsiveness by how quickly the first word appears.
Time to First Token (TTFT) Benchmarks
Measured on a standard 100-token input prompt, US East region, streaming enabled:
Fastest TTFT: Llama 3.1 8B on Together.ai (~150ms) and Gemini 2.0 Flash (~180ms). Budget models consistently win on speed because they have fewer parameters to process.
Output Speed (Tokens per Second)
How fast each model generates tokens after the first one:
Fastest output: Llama 3.1 8B (~150 tok/s) and DeepSeek V4 Flash (~130 tok/s). Open models on optimized infrastructure consistently outperform closed APIs on raw speed.
The Speed vs. Price Tradeoff
Faster isn't always better — sometimes speed costs more. Here's the real relationship:
Key insight: The cheapest models are also the fastest. Budget models have fewer parameters, so they process and generate tokens faster. Premium models trade speed for reasoning quality.
Latency by Use Case
Different applications have different speed requirements:
Real-Time Chatbots (TTFT < 500ms needed)
Users expect near-instant responses. Recommended models:
- Gemini 2.0 Flash — 180ms TTFT, $0.10/$0.40 per 1M tokens
- GPT-4o mini — 220ms TTFT, $0.15/$0.60 per 1M tokens
- Mistral Small 4 — 200ms TTFT, $0.10/$0.30 per 1M tokens
Code Generation (TTFT < 1s acceptable)
Developers tolerate longer waits for better code. Recommended models:
- Claude Sonnet 4 — 450ms TTFT, excellent code quality
- GPT-4o — 350ms TTFT, strong all-around
- Gemini 2.5 Pro — 500ms TTFT, huge context window
Background Processing (Speed Less Critical)
Batch jobs, ETL pipelines, scheduled tasks — TTFT doesn't matter. Optimize for cost:
- Gemini 2.0 Flash — $0.10/$0.40 per 1M tokens
- Llama 3.1 8B — $0.18/$0.18 per 1M tokens
- DeepSeek V4 Flash — $0.14/$0.28 per 1M tokens
How to Measure Your Own Latency
Published benchmarks are useful, but your mileage will vary. Here's how to measure:
- Measure from your server, not the browser — network latency adds noise
- Use streaming mode — TTFT is only meaningful with streaming
- Sample 100+ requests — latency varies by time of day and load
- Test with your actual prompt length — longer inputs increase TTFT
- Track p50, p95, and p99 — average latency hides outliers
Optimizing for Speed
Reducing latency without switching models:
- Use the nearest region — OpenAI (US), Anthropic (US), Google (multi-region)
- Keep prompts short — shorter input = faster TTFT
- Set max_tokens — prevents overly long responses
- Stream responses — users see output immediately instead of waiting for completion
- Use prompt caching — Anthropic and OpenAI cache repeated prefixes, reducing latency for follow-up calls
The Bottom Line
For most applications, Gemini 2.0 Flash or GPT-4o mini offer the best speed-to-cost ratio. They're fast enough for real-time chat, cheap enough for high volume, and capable enough for most tasks.
Reserve premium models (Claude Sonnet 4, GPT-4o) for tasks where quality justifies the speed and cost tradeoff. And for background processing, always use the cheapest model — speed doesn't matter when no one is waiting.
Calculate your API cost at any speed tier.
Try the APIpulse CalculatorGet notified when API prices change
No spam. Only pricing updates and new features. Unsubscribe anytime.