โ† Back to blog

AI API Caching Strategies: Reduce LLM Costs by 60%+

Caching is the highest-ROI cost optimization technique for AI APIs. A well-implemented cache can eliminate 30-70% of your API calls entirely โ€” zero cost, zero latency penalty on cache hits. This guide covers three caching strategies with real implementation examples and cost breakdowns.

Why Caching Works So Well for LLMs

Most AI API workloads have significant request overlap. A customer support bot sees the same questions repeatedly. A content generator processes similar prompts. A classification pipeline handles recurring patterns. Every duplicate request that hits your cache instead of the API is pure savings.

A SaaS company processing 15,000 chat requests/day implemented exact-match caching and immediately reduced their API bill from $450/month to $210/month โ€” a 53% reduction with zero quality loss.

The key insight: LLM APIs charge per token. If you can serve a response from cache, you pay nothing for that request. Even a 30% cache hit rate means 30% of your costs disappear overnight.

Strategy 1: Exact-Match Caching

The simplest and most reliable caching approach. Store the full prompt + response. If the exact same request comes in again, return the cached result without calling the API.

How it works

  1. Hash the request (system prompt + user message + model + parameters)
  2. Check if hash exists in your cache store (Redis, SQLite, or even a dictionary)
  3. On hit: return cached response immediately
  4. On miss: call the API, store the response with the hash, return it
import hashlib, json, redis

r = redis.Redis()

def cached_completion(messages, model="gpt-4o-mini", **kwargs):
    # Create a cache key from the full request
    cache_input = json.dumps({"messages": messages, "model": model, **kwargs}, sort_keys=True)
    cache_key = f"llm:{hashlib.sha256(cache_input.encode()).hexdigest()}"

    # Check cache
    cached = r.get(cache_key)
    if cached:
        return json.loads(cached)

    # Cache miss โ€” call the API
    response = openai.chat.completions.create(messages=messages, model=model, **kwargs)
    result = response.model_dump()

    # Store in cache (expire after 24 hours)
    r.setex(cache_key, 86400, json.dumps(result))
    return result

When exact-match caching works best

When it falls short

Exact-match requires identical inputs. If users phrase the same question differently ("What's the price?" vs "How much does it cost?"), they won't match. That's where semantic caching comes in.

Exact-match caching impact
10,000 requests/day, GPT-4o mini $45/mo
With 35% exact-match hit rate $29/mo
Monthly savings $16/mo (35% reduction)

Strategy 2: Semantic Caching

Semantic caching matches requests by meaning, not exact text. "How do I reset my password?" and "I forgot my password, how do I fix it?" would both hit the same cache entry because they're semantically equivalent.

How it works

  1. Generate an embedding vector for each request using a cheap embedding model
  2. Store the embedding + response in a vector database
  3. On new requests, search for the nearest embedding (cosine similarity above a threshold)
  4. On hit: return cached response; On miss: call API, store embedding + response
import numpy as np
from openai import OpenAI

client = OpenAI()

def get_embedding(text):
    response = client.embeddings.create(input=text, model="text-embedding-3-small")
    return response.data[0].embedding

def semantic_cached_completion(messages, model="gpt-4o-mini",
                                similarity_threshold=0.92, **kwargs):
    # Combine messages into a single query string for embedding
    query_text = " ".join(m["content"] for m in messages)
    query_embedding = get_embedding(query_text)

    # Search vector DB for similar cached queries
    similar = vector_db.search(query_embedding, top_k=1,
                                threshold=similarity_threshold)

    if similar:
        return similar[0]["response"]

    # Cache miss
    response = client.chat.completions.create(
        messages=messages, model=model, **kwargs
    )
    result = response.model_dump()

    # Store embedding + response
    vector_db.insert(query_embedding, {
        "response": result,
        "query": query_text,
        "model": model
    })
    return result

Semantic caching trade-offs

Factor Exact-Match Semantic
Hit rate 20-40% 40-65%
Quality risk None (exact response) Low (similar but not identical query)
Infrastructure Redis or in-memory Vector DB (Pinecone, Weaviate, pgvector)
Added latency <1ms (hash lookup) 5-20ms (embedding + vector search)
Cost to run Near zero Embedding cost (~$0.02/1M tokens)
Best for FAQ bots, templates Conversational, varied phrasing

Tuning the similarity threshold

The threshold controls how "similar" queries need to be to count as a cache hit. Too low and you'll return wrong answers; too high and you'll miss real matches.

Start at 0.92 and adjust based on your quality metrics. For support bots where accuracy matters, stay above 0.93.

Semantic caching impact (15,000 requests/day)
Without caching (GPT-4o) $675/mo
Exact-match only (35% hit rate) $439/mo
Semantic caching (55% hit rate) $304/mo
Total savings with semantic caching $371/mo (55% reduction)

Strategy 3: Provider Prompt Caching

Both OpenAI and Anthropic now offer built-in prompt caching at the API level. This automatically caches the prefix of your prompts and gives you a discount on subsequent requests that share the same prefix.

OpenAI Prompt Caching

Anthropic Prompt Caching

# Anthropic prompt caching example
response = client.messages.create(
    model="claude-sonnet-4-6-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a helpful assistant for Acme Corp. [long system prompt...]",
            "cache_control": {"type": "ephemeral"}  # Cache this prefix
        }
    ],
    messages=[{"role": "user", "content": "What's the return policy?"}]
)

When provider caching helps most

Provider prompt caching impact (1,000 requests/day with 800-token system prompt)
Without caching (GPT-4o, $2.50/M input) $600/mo
With provider caching (90% discount on prefix) $108/mo
Monthly savings $492/mo (82% reduction)

Combining Strategies for Maximum Savings

The best results come from layering multiple caching approaches. Here's a production architecture that combines all three:

  1. Layer 1 โ€” Exact-match cache (Redis). Fast, zero-cost lookups. Catches duplicate requests.
  2. Layer 2 โ€” Semantic cache (Vector DB). Catches paraphrased versions of cached queries.
  3. Layer 3 โ€” Provider prompt caching (Automatic). Even on cache misses, the system prompt prefix is cached by the provider at 90% off.

On a cache miss at Layer 1, you check Layer 2. If both miss, the API call still benefits from provider caching on the prompt prefix. The result: you rarely pay full price for any request.

Combined caching: 20,000 requests/day, GPT-4o, 600-token system prompt
No caching $900/mo
Exact-match only (35% hit rate) $585/mo
Exact + semantic (55% hit rate) $405/mo
All three layers (55% hit + provider caching on misses) $162/mo
Maximum savings $738/mo (82% reduction)

Cache Invalidation: The Hard Part

Caching is easy. Invalidating caches correctly is where most teams struggle. Strategies:

Rule of thumb: Use TTL for most cases. Only build event-driven invalidation if stale responses would cause user-facing issues.

Measuring Cache Performance

Track these metrics to know if your caching is working:

Metric What It Tells You Target
Hit rate % of requests served from cache 30%+ (exact), 50%+ (semantic)
Cost per request Average API cost divided by total requests Decreasing over time
Cache latency Time to serve a cached response <5ms (exact), <25ms (semantic)
Stale response rate % of cached responses that were outdated <1%
Cache size Storage used by cache entries Monitor growth, set eviction policies

See how much you could save with caching.

Enter your current API usage and get a personalized cost projection with and without caching.

Try the APIpulse Calculator

Implementation Checklist

  • Identify which request types have high repetition (FAQ, templates, classification)
  • Start with exact-match caching โ€” it's the easiest win
  • Measure your natural hit rate before optimizing
  • Add semantic caching if exact-match hit rate is below 30%
  • Enable provider prompt caching (it's free and automatic)
  • Set appropriate TTLs based on data freshness requirements
  • Monitor hit rate, cost per request, and stale response rate
  • Implement cache invalidation for data-sensitive workloads
  • Consider combining all three layers for maximum savings
  • Use APIpulse to track your cost-per-request trends

Related Reading

Get notified when API prices change

No spam. Only pricing updates and new features. Unsubscribe anytime.

Want to optimize your AI API costs?

APIpulse Pro ($29 one-time) includes saved scenarios, cost report exports, and personalized recommendations that can save you up to 40%.

Get Pro โ€” $29