AI API Rate Limits Compared: OpenAI vs Anthropic vs Google (2026)
A comprehensive comparison of rate limits across every major LLM API provider — plus practical strategies for handling throttling in production.
Rate limits are the silent killer of AI applications. Your code works perfectly in development, then hits production traffic and starts throwing 429 errors. Understanding each provider's rate limits — and building proper handling — is essential for any production AI application.
In this guide, we compare rate limits across OpenAI, Anthropic, Google, Mistral, and Cohere. We'll cover RPM (requests per minute), TPM (tokens per minute), and daily limits, plus practical code patterns for handling rate limiting gracefully.
Rate Limits by Provider
Rate limits vary significantly between providers and between tiers within each provider. Here's the current landscape as of April 2026.
OpenAI Rate Limits
OpenAI uses a tiered system based on your spending history. New accounts start at Tier 1 and progress to higher tiers as usage increases.
| Tier | RPM | TPM | RPD | Monthly Spend Threshold |
|---|---|---|---|---|
| Tier 1 | 60 | 40,000 | 500 | $0–$50 |
| Tier 2 | 500 | 80,000 | 5,000 | $50–$100 |
| Tier 3 | 5,000 | 300,000 | 10,000 | $100–$500 |
| Tier 4 | 10,000 | 1,000,000 | 30,000 | $500–$5,000 |
| Tier 5 | 10,000 | 10,000,000 | No limit | $5,000+ |
Anthropic Rate Limits
Anthropic also uses a tiered system, but with different thresholds and limits. Their limits are generally more generous for smaller accounts.
| Tier | RPM | TPM | Monthly Spend Threshold |
|---|---|---|---|
| Free | 5 | 25,000 | $0 |
| Tier 1 | 50 | 50,000 | $5+ |
| Tier 2 | 1,000 | 100,000 | $40+ |
| Tier 3 | 2,000 | 200,000 | $200+ |
| Tier 4 | 4,000 | 400,000 | $500+ |
Anthropic also offers prompt caching which can reduce both costs and effective rate limit usage by caching repeated system prompts and context.
Google Gemini Rate Limits
Google's approach is different — they offer higher base limits but with per-model variations. Free tier access is generous but comes with lower limits.
| Tier | Model | RPM | TPM | RPD |
|---|---|---|---|---|
| Free | Gemini 2.0 Flash | 15 | 1,000,000 | 1,500 |
| Free | Gemini 2.5 Pro | 5 | 250,000 | 100 |
| Pay-as-you-go | Gemini 2.0 Flash | 2,000 | 4,000,000 | No limit |
| Pay-as-you-go | Gemini 2.5 Pro | 1,000 | 4,000,000 | No limit |
Mistral Rate Limits
| Tier | RPM | TPM | Notes |
|---|---|---|---|
| Free | 30 | 50,000 | Limited to Mistral Small 4 |
| Paid | 500 | 500,000 | All models |
| Enterprise | Custom | Custom | Contact sales |
Cohere Rate Limits
| Tier | RPM | TPM | Notes |
|---|---|---|---|
| Trial | 100 | 100,000 | Command R only |
| Production | 1,000 | 1,000,000 | All models |
| Enterprise | Custom | Custom | SLA guarantees |
Side-by-Side Comparison
Here's a quick reference comparing paid-tier rate limits across all providers:
| Provider | Paid RPM | Paid TPM | Free Tier? | Batch API? |
|---|---|---|---|---|
| OpenAI | 5,000–10,000 | 300K–10M | No | Yes (50% off) |
| Anthropic | 1,000–4,000 | 100K–400K | Yes (limited) | Yes (50% off) |
| 1,000–2,000 | 4M | Yes (generous) | No | |
| Mistral | 500 | 500K | Yes (limited) | No |
| Cohere | 1,000 | 1M | Yes (trial) | No |
How to Handle Rate Limiting in Production
Building robust rate limit handling is essential. Here are the key patterns every production AI application should implement.
1. Exponential Backoff with Jitter
The most important pattern. When you receive a 429 response, wait progressively longer before retrying:
async function callWithRetry(fn, maxRetries = 3) {
for (let i = 0; i < maxRetries; i++) {
try {
return await fn();
} catch (error) {
if (error.status !== 429 || i === maxRetries - 1) throw error;
// Exponential backoff with jitter
const baseDelay = Math.pow(2, i) * 1000;
const jitter = Math.random() * 1000;
const delay = baseDelay + jitter;
console.log(`Rate limited. Retrying in ${Math.round(delay)}ms...`);
await new Promise(r => setTimeout(r, delay));
}
}
}
2. Respect the Retry-After Header
Most providers include a Retry-After header in 429 responses. Always check it:
if (error.status === 429) {
const retryAfter = error.headers?.['retry-after'];
const delay = retryAfter
? parseInt(retryAfter) * 1000
: Math.pow(2, attempt) * 1000;
await new Promise(r => setTimeout(r, delay));
}
3. Implement a Token Bucket
For high-throughput applications, implement a token bucket to proactively limit requests:
class TokenBucket {
constructor(maxTokens, refillRate) {
this.tokens = maxTokens;
this.maxTokens = maxTokens;
this.refillRate = refillRate; // tokens per second
this.lastRefill = Date.now();
}
async acquire() {
this.refill();
if (this.tokens < 1) {
const waitTime = ((1 - this.tokens) / this.refillRate) * 1000;
await new Promise(r => setTimeout(r, waitTime));
this.refill();
}
this.tokens -= 1;
}
refill() {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000;
this.tokens = Math.min(this.maxTokens, this.tokens + elapsed * this.refillRate);
this.lastRefill = now;
}
}
// Usage: 60 RPM = 1 request per second
const bucket = new TokenBucket(60, 1);
async function makeRequest() {
await bucket.acquire();
return callAPI();
}
4. Multi-Provider Fallback
The most resilient approach: fall back to a different provider when one is rate-limited:
const providers = [
{ name: 'OpenAI', call: callOpenAI },
{ name: 'Anthropic', call: callAnthropic },
{ name: 'Google', call: callGoogle }
];
async function callWithFallback(prompt) {
for (const provider of providers) {
try {
return await callWithRetry(() => provider.call(prompt));
} catch (error) {
if (error.status === 429) {
console.log(`${provider.name} rate limited, trying next...`);
continue;
}
throw error;
}
}
throw new Error('All providers rate limited');
}
5. Queue-Based Architecture
For maximum control, use a message queue to manage request rates:
- Redis + Bull: Use a job queue with configurable concurrency
- SQS + Lambda: AWS-native approach with built-in retry
- Cloud Tasks: Google Cloud's managed queue with rate limiting
This approach decouples your API endpoints from the LLM calls, allowing you to smooth out traffic spikes and respect rate limits consistently.
Provider-Specific Optimization Tips
OpenAI
- Use the Batch API for non-real-time workloads (50% cost savings, separate rate limits)
- Upgrade tiers by increasing monthly spend — Tier 5 unlocks 10M TPM
- Use streaming to reduce perceived latency without increasing rate limit usage
Anthropic
- Enable prompt caching to reduce effective token usage and rate limit consumption
- Use message batching for bulk processing
- Tier upgrades are based on spending history — consistent usage unlocks higher limits
- The free tier is generous enough for many production workloads
- Gemini 2.0 Flash has higher free-tier limits than Pro — use it for high-volume tasks
- Context caching can reduce token usage for repeated prompts
How to Choose Based on Rate Limits
| Use Case | Recommended Provider | Why |
|---|---|---|
| High-volume chatbot | Google Gemini Flash | 4M TPM paid, generous free tier |
| Enterprise with SLA needs | OpenAI Tier 5 | 10M TPM, 10K RPM, no RPD limit |
| Budget startup | Google Gemini Flash (free) | 1M TPM at zero cost |
| Batch processing | OpenAI or Anthropic | Batch API with separate limits, 50% off |
| RAG pipelines | Cohere | 1M TPM, optimized for retrieval |
Calculate Your API Costs
Use the APIpulse calculator to estimate your monthly spend across all providers, accounting for your expected request volume.
Try the Calculator