← Back to blog

Mid Benchmarks April 26, 2026

LLM API Latency Compared: Speed Benchmarks 2026

Speed matters. A 200ms difference in API response time can mean the difference between a fluid chat experience and a frustrating one. Here's how every major LLM provider compares on real-world latency — and how speed intersects with cost.

Understanding LLM Latency

API latency has three components:

Time to First Token (TTFT) — how long until the model starts streaming. This is what users perceive as "wait time."
Inter-Token Latency — how fast each subsequent token arrives. Measured in tokens per second (tok/s).
Total Response Time — end-to-end time for a complete response. Depends on output length.

For chat applications, TTFT is the most important metric — users judge responsiveness by how quickly the first word appears.

Time to First Token (TTFT) Benchmarks

Measured on a standard 100-token input prompt, US East region, streaming enabled:

TTFT by model (lower is better)

Gemini 2.0 Flash~180ms

GPT-4o mini~220ms

GPT-4o~350ms

Claude Haiku 4.5~250ms

Claude Sonnet 4~450ms

Claude 4 Opus~800ms

Gemini 2.5 Pro~500ms

Mistral Large 3~400ms

Mistral Small 4~200ms

GPT-5~600ms

DeepSeek V4 Flash~220ms

Llama 3.1 8B (Together.ai)~150ms

Fastest TTFT: Llama 3.1 8B on Together.ai (~150ms) and Gemini 2.0 Flash (~180ms). Budget models consistently win on speed because they have fewer parameters to process.

Output Speed (Tokens per Second)

How fast each model generates tokens after the first one:

Output speed in tok/s (higher is better)

Gemini 2.0 Flash~120 tok/s

GPT-4o mini~100 tok/s

Llama 3.1 8B (Together.ai)~150 tok/s

Mistral Small 4~110 tok/s

GPT-4o~80 tok/s

Claude Haiku 4.5~90 tok/s

Claude Sonnet 4~65 tok/s

Gemini 2.5 Pro~70 tok/s

Mistral Large 3~75 tok/s

GPT-5~55 tok/s

Claude 4 Opus~40 tok/s

DeepSeek V4 Flash~130 tok/s

Fastest output: Llama 3.1 8B (~150 tok/s) and DeepSeek V4 Flash (~130 tok/s). Open models on optimized infrastructure consistently outperform closed APIs on raw speed.

The Speed vs. Price Tradeoff

Faster isn't always better — sometimes speed costs more. Here's the real relationship:

Speed vs cost: input + output per 1M tokens

Llama 3.1 8B — 150 tok/s — $0.18/$0.18Best value

Gemini 2.0 Flash — 120 tok/s — $0.10/$0.40Cheapest

GPT-4o mini — 100 tok/s — $0.15/$0.60Good balance

GPT-4o — 80 tok/s — $2.50/$10.00Premium speed

Claude Sonnet 4 — 65 tok/s — $3.00/$15.00Quality focus

Claude 4 Opus — 40 tok/s — $15.00/$75.00Slowest, most expensive

Key insight: The cheapest models are also the fastest. Budget models have fewer parameters, so they process and generate tokens faster. Premium models trade speed for reasoning quality.

Latency by Use Case

Different applications have different speed requirements:

Real-Time Chatbots (TTFT < 500ms needed)

Users expect near-instant responses. Recommended models:

Gemini 2.0 Flash — 180ms TTFT, $0.10/$0.40 per 1M tokens
GPT-4o mini — 220ms TTFT, $0.15/$0.60 per 1M tokens
Mistral Small 4 — 200ms TTFT, $0.10/$0.30 per 1M tokens

Code Generation (TTFT < 1s acceptable)

Developers tolerate longer waits for better code. Recommended models:

Claude Sonnet 4 — 450ms TTFT, excellent code quality
GPT-4o — 350ms TTFT, strong all-around
Gemini 2.5 Pro — 500ms TTFT, huge context window

Background Processing (Speed Less Critical)

Batch jobs, ETL pipelines, scheduled tasks — TTFT doesn't matter. Optimize for cost:

Gemini 2.0 Flash — $0.10/$0.40 per 1M tokens
Llama 3.1 8B — $0.18/$0.18 per 1M tokens
DeepSeek V4 Flash — $0.14/$0.28 per 1M tokens

How to Measure Your Own Latency

Published benchmarks are useful, but your mileage will vary. Here's how to measure:

Measure from your server, not the browser — network latency adds noise
Use streaming mode — TTFT is only meaningful with streaming
Sample 100+ requests — latency varies by time of day and load
Test with your actual prompt length — longer inputs increase TTFT
Track p50, p95, and p99 — average latency hides outliers

Optimizing for Speed

Reducing latency without switching models:

Use the nearest region — OpenAI (US), Anthropic (US), Google (multi-region)
Keep prompts short — shorter input = faster TTFT
Set max_tokens — prevents overly long responses
Stream responses — users see output immediately instead of waiting for completion
Use prompt caching — Anthropic and OpenAI cache repeated prefixes, reducing latency for follow-up calls

The Bottom Line

For most applications, Gemini 2.0 Flash or GPT-4o mini offer the best speed-to-cost ratio. They're fast enough for real-time chat, cheap enough for high volume, and capable enough for most tasks.

Reserve premium models (Claude Sonnet 4, GPT-4o) for tasks where quality justifies the speed and cost tradeoff. And for background processing, always use the cheapest model — speed doesn't matter when no one is waiting.

Calculate your API cost at any speed tier.

Try the APIpulse Calculator

Get notified when API prices change

No spam. Only pricing updates and new features. Unsubscribe anytime.