Guide April 26, 2026 12 min read

AI API Rate Limits Compared: OpenAI vs Anthropic vs Google (2026)

A comprehensive comparison of rate limits across every major LLM API provider — plus practical strategies for handling throttling in production.

Rate limits are the silent killer of AI applications. Your code works perfectly in development, then hits production traffic and starts throwing 429 errors. Understanding each provider's rate limits — and building proper handling — is essential for any production AI application.

In this guide, we compare rate limits across OpenAI, Anthropic, Google, Mistral, and Cohere. We'll cover RPM (requests per minute), TPM (tokens per minute), and daily limits, plus practical code patterns for handling rate limiting gracefully.

Rate Limits by Provider

Rate limits vary significantly between providers and between tiers within each provider. Here's the current landscape as of April 2026.

OpenAI Rate Limits

OpenAI uses a tiered system based on your spending history. New accounts start at Tier 1 and progress to higher tiers as usage increases.

Tier	RPM	TPM	RPD	Monthly Spend Threshold
Tier 1	60	40,000	500	$0–$50
Tier 2	500	80,000	5,000	$50–$100
Tier 3	5,000	300,000	10,000	$100–$500
Tier 4	10,000	1,000,000	30,000	$500–$5,000
Tier 5	10,000	10,000,000	No limit	$5,000+

Tip: OpenAI's Batch API offers 50% lower costs and separate rate limits — ideal for non-real-time workloads like classification or summarization.

Anthropic Rate Limits

Anthropic also uses a tiered system, but with different thresholds and limits. Their limits are generally more generous for smaller accounts.

Tier	RPM	TPM	Monthly Spend Threshold
Free	5	25,000	$0
Tier 1	50	50,000	$5+
Tier 2	1,000	100,000	$40+
Tier 3	2,000	200,000	$200+
Tier 4	4,000	400,000	$500+

Anthropic also offers prompt caching which can reduce both costs and effective rate limit usage by caching repeated system prompts and context.

Google Gemini Rate Limits

Google's approach is different — they offer higher base limits but with per-model variations. Free tier access is generous but comes with lower limits.

Tier	Model	RPM	TPM	RPD
Free	Gemini 2.0 Flash	15	1,000,000	1,500
Free	Gemini 2.5 Pro	5	250,000	100
Pay-as-you-go	Gemini 2.0 Flash	2,000	4,000,000	No limit
Pay-as-you-go	Gemini 2.5 Pro	1,000	4,000,000	No limit

Tip: Google's free tier is the most generous for low-volume applications. Gemini 2.0 Flash offers 1M TPM for free — enough for many production chatbots.

Mistral Rate Limits

Tier	RPM	TPM	Notes
Free	30	50,000	Limited to Mistral Small 4
Paid	500	500,000	All models
Enterprise	Custom	Custom	Contact sales

Cohere Rate Limits

Tier	RPM	TPM	Notes
Trial	100	100,000	Command R only
Production	1,000	1,000,000	All models
Enterprise	Custom	Custom	SLA guarantees

Side-by-Side Comparison

Here's a quick reference comparing paid-tier rate limits across all providers:

Provider	Paid RPM	Paid TPM	Free Tier?	Batch API?
OpenAI	5,000–10,000	300K–10M	No	Yes (50% off)
Anthropic	1,000–4,000	100K–400K	Yes (limited)	Yes (50% off)
Google	1,000–2,000	4M	Yes (generous)	No
Mistral	500	500K	Yes (limited)	No
Cohere	1,000	1M	Yes (trial)	No

Warning: Rate limits change frequently. Always check the provider's official documentation for current limits. These numbers are accurate as of April 2026 but may have changed since publication.

How to Handle Rate Limiting in Production

Building robust rate limit handling is essential. Here are the key patterns every production AI application should implement.

1. Exponential Backoff with Jitter

The most important pattern. When you receive a 429 response, wait progressively longer before retrying:

async function callWithRetry(fn, maxRetries = 3) {
    for (let i = 0; i < maxRetries; i++) {
        try {
            return await fn();
        } catch (error) {
            if (error.status !== 429 || i === maxRetries - 1) throw error;

            // Exponential backoff with jitter
            const baseDelay = Math.pow(2, i) * 1000;
            const jitter = Math.random() * 1000;
            const delay = baseDelay + jitter;

            console.log(`Rate limited. Retrying in ${Math.round(delay)}ms...`);
            await new Promise(r => setTimeout(r, delay));
        }
    }
}

2. Respect the Retry-After Header

Most providers include a Retry-After header in 429 responses. Always check it:

if (error.status === 429) {
    const retryAfter = error.headers?.['retry-after'];
    const delay = retryAfter
        ? parseInt(retryAfter) * 1000
        : Math.pow(2, attempt) * 1000;
    await new Promise(r => setTimeout(r, delay));
}

3. Implement a Token Bucket

For high-throughput applications, implement a token bucket to proactively limit requests:

class TokenBucket {
    constructor(maxTokens, refillRate) {
        this.tokens = maxTokens;
        this.maxTokens = maxTokens;
        this.refillRate = refillRate; // tokens per second
        this.lastRefill = Date.now();
    }

    async acquire() {
        this.refill();
        if (this.tokens < 1) {
            const waitTime = ((1 - this.tokens) / this.refillRate) * 1000;
            await new Promise(r => setTimeout(r, waitTime));
            this.refill();
        }
        this.tokens -= 1;
    }

    refill() {
        const now = Date.now();
        const elapsed = (now - this.lastRefill) / 1000;
        this.tokens = Math.min(this.maxTokens, this.tokens + elapsed * this.refillRate);
        this.lastRefill = now;
    }
}

// Usage: 60 RPM = 1 request per second
const bucket = new TokenBucket(60, 1);
async function makeRequest() {
    await bucket.acquire();
    return callAPI();
}

4. Multi-Provider Fallback

The most resilient approach: fall back to a different provider when one is rate-limited:

const providers = [
    { name: 'OpenAI', call: callOpenAI },
    { name: 'Anthropic', call: callAnthropic },
    { name: 'Google', call: callGoogle }
];

async function callWithFallback(prompt) {
    for (const provider of providers) {
        try {
            return await callWithRetry(() => provider.call(prompt));
        } catch (error) {
            if (error.status === 429) {
                console.log(`${provider.name} rate limited, trying next...`);
                continue;
            }
            throw error;
        }
    }
    throw new Error('All providers rate limited');
}

5. Queue-Based Architecture

For maximum control, use a message queue to manage request rates:

Redis + Bull: Use a job queue with configurable concurrency
SQS + Lambda: AWS-native approach with built-in retry
Cloud Tasks: Google Cloud's managed queue with rate limiting

This approach decouples your API endpoints from the LLM calls, allowing you to smooth out traffic spikes and respect rate limits consistently.

Provider-Specific Optimization Tips

OpenAI

Use the Batch API for non-real-time workloads (50% cost savings, separate rate limits)
Upgrade tiers by increasing monthly spend — Tier 5 unlocks 10M TPM
Use streaming to reduce perceived latency without increasing rate limit usage

Anthropic

Enable prompt caching to reduce effective token usage and rate limit consumption
Use message batching for bulk processing
Tier upgrades are based on spending history — consistent usage unlocks higher limits

Google

The free tier is generous enough for many production workloads
Gemini 2.0 Flash has higher free-tier limits than Pro — use it for high-volume tasks
Context caching can reduce token usage for repeated prompts

How to Choose Based on Rate Limits

Use Case	Recommended Provider	Why
High-volume chatbot	Google Gemini Flash	4M TPM paid, generous free tier
Enterprise with SLA needs	OpenAI Tier 5	10M TPM, 10K RPM, no RPD limit
Budget startup	Google Gemini Flash (free)	1M TPM at zero cost
Batch processing	OpenAI or Anthropic	Batch API with separate limits, 50% off
RAG pipelines	Cohere	1M TPM, optimized for retrieval

Calculate Your API Costs

Use the APIpulse calculator to estimate your monthly spend across all providers, accounting for your expected request volume.

Try the Calculator

Get notified when API prices change

No spam. Only pricing updates and new features. Unsubscribe anytime.