AI API Rate Limits Compared: OpenAI vs Anthropic vs Google (2026)

A comprehensive comparison of rate limits across every major LLM API provider — plus practical strategies for handling throttling in production.

Rate limits are the silent killer of AI applications. Your code works perfectly in development, then hits production traffic and starts throwing 429 errors. Understanding each provider's rate limits — and building proper handling — is essential for any production AI application.

In this guide, we compare rate limits across OpenAI, Anthropic, Google, Mistral, and Cohere. We'll cover RPM (requests per minute), TPM (tokens per minute), and daily limits, plus practical code patterns for handling rate limiting gracefully.

Rate Limits by Provider

Rate limits vary significantly between providers and between tiers within each provider. Here's the current landscape as of April 2026.

OpenAI Rate Limits

OpenAI uses a tiered system based on your spending history. New accounts start at Tier 1 and progress to higher tiers as usage increases.

Tier RPM TPM RPD Monthly Spend Threshold
Tier 1 60 40,000 500 $0–$50
Tier 2 500 80,000 5,000 $50–$100
Tier 3 5,000 300,000 10,000 $100–$500
Tier 4 10,000 1,000,000 30,000 $500–$5,000
Tier 5 10,000 10,000,000 No limit $5,000+
Tip: OpenAI's Batch API offers 50% lower costs and separate rate limits — ideal for non-real-time workloads like classification or summarization.

Anthropic Rate Limits

Anthropic also uses a tiered system, but with different thresholds and limits. Their limits are generally more generous for smaller accounts.

Tier RPM TPM Monthly Spend Threshold
Free 5 25,000 $0
Tier 1 50 50,000 $5+
Tier 2 1,000 100,000 $40+
Tier 3 2,000 200,000 $200+
Tier 4 4,000 400,000 $500+

Anthropic also offers prompt caching which can reduce both costs and effective rate limit usage by caching repeated system prompts and context.

Google Gemini Rate Limits

Google's approach is different — they offer higher base limits but with per-model variations. Free tier access is generous but comes with lower limits.

Tier Model RPM TPM RPD
Free Gemini 2.0 Flash 15 1,000,000 1,500
Free Gemini 2.5 Pro 5 250,000 100
Pay-as-you-go Gemini 2.0 Flash 2,000 4,000,000 No limit
Pay-as-you-go Gemini 2.5 Pro 1,000 4,000,000 No limit
Tip: Google's free tier is the most generous for low-volume applications. Gemini 2.0 Flash offers 1M TPM for free — enough for many production chatbots.

Mistral Rate Limits

Tier RPM TPM Notes
Free 30 50,000 Limited to Mistral Small 4
Paid 500 500,000 All models
Enterprise Custom Custom Contact sales

Cohere Rate Limits

Tier RPM TPM Notes
Trial 100 100,000 Command R only
Production 1,000 1,000,000 All models
Enterprise Custom Custom SLA guarantees

Side-by-Side Comparison

Here's a quick reference comparing paid-tier rate limits across all providers:

Provider Paid RPM Paid TPM Free Tier? Batch API?
OpenAI 5,000–10,000 300K–10M No Yes (50% off)
Anthropic 1,000–4,000 100K–400K Yes (limited) Yes (50% off)
Google 1,000–2,000 4M Yes (generous) No
Mistral 500 500K Yes (limited) No
Cohere 1,000 1M Yes (trial) No
Warning: Rate limits change frequently. Always check the provider's official documentation for current limits. These numbers are accurate as of April 2026 but may have changed since publication.

How to Handle Rate Limiting in Production

Building robust rate limit handling is essential. Here are the key patterns every production AI application should implement.

1. Exponential Backoff with Jitter

The most important pattern. When you receive a 429 response, wait progressively longer before retrying:

async function callWithRetry(fn, maxRetries = 3) {
    for (let i = 0; i < maxRetries; i++) {
        try {
            return await fn();
        } catch (error) {
            if (error.status !== 429 || i === maxRetries - 1) throw error;

            // Exponential backoff with jitter
            const baseDelay = Math.pow(2, i) * 1000;
            const jitter = Math.random() * 1000;
            const delay = baseDelay + jitter;

            console.log(`Rate limited. Retrying in ${Math.round(delay)}ms...`);
            await new Promise(r => setTimeout(r, delay));
        }
    }
}

2. Respect the Retry-After Header

Most providers include a Retry-After header in 429 responses. Always check it:

if (error.status === 429) {
    const retryAfter = error.headers?.['retry-after'];
    const delay = retryAfter
        ? parseInt(retryAfter) * 1000
        : Math.pow(2, attempt) * 1000;
    await new Promise(r => setTimeout(r, delay));
}

3. Implement a Token Bucket

For high-throughput applications, implement a token bucket to proactively limit requests:

class TokenBucket {
    constructor(maxTokens, refillRate) {
        this.tokens = maxTokens;
        this.maxTokens = maxTokens;
        this.refillRate = refillRate; // tokens per second
        this.lastRefill = Date.now();
    }

    async acquire() {
        this.refill();
        if (this.tokens < 1) {
            const waitTime = ((1 - this.tokens) / this.refillRate) * 1000;
            await new Promise(r => setTimeout(r, waitTime));
            this.refill();
        }
        this.tokens -= 1;
    }

    refill() {
        const now = Date.now();
        const elapsed = (now - this.lastRefill) / 1000;
        this.tokens = Math.min(this.maxTokens, this.tokens + elapsed * this.refillRate);
        this.lastRefill = now;
    }
}

// Usage: 60 RPM = 1 request per second
const bucket = new TokenBucket(60, 1);
async function makeRequest() {
    await bucket.acquire();
    return callAPI();
}

4. Multi-Provider Fallback

The most resilient approach: fall back to a different provider when one is rate-limited:

const providers = [
    { name: 'OpenAI', call: callOpenAI },
    { name: 'Anthropic', call: callAnthropic },
    { name: 'Google', call: callGoogle }
];

async function callWithFallback(prompt) {
    for (const provider of providers) {
        try {
            return await callWithRetry(() => provider.call(prompt));
        } catch (error) {
            if (error.status === 429) {
                console.log(`${provider.name} rate limited, trying next...`);
                continue;
            }
            throw error;
        }
    }
    throw new Error('All providers rate limited');
}

5. Queue-Based Architecture

For maximum control, use a message queue to manage request rates:

This approach decouples your API endpoints from the LLM calls, allowing you to smooth out traffic spikes and respect rate limits consistently.

Provider-Specific Optimization Tips

OpenAI

Anthropic

Google

How to Choose Based on Rate Limits

Use Case Recommended Provider Why
High-volume chatbot Google Gemini Flash 4M TPM paid, generous free tier
Enterprise with SLA needs OpenAI Tier 5 10M TPM, 10K RPM, no RPD limit
Budget startup Google Gemini Flash (free) 1M TPM at zero cost
Batch processing OpenAI or Anthropic Batch API with separate limits, 50% off
RAG pipelines Cohere 1M TPM, optimized for retrieval

Calculate Your API Costs

Use the APIpulse calculator to estimate your monthly spend across all providers, accounting for your expected request volume.

Try the Calculator

Get notified when API prices change

No spam. Only pricing updates and new features. Unsubscribe anytime.