AI API Caching Strategies: Reduce LLM Costs by 60%+
Caching is the highest-ROI cost optimization technique for AI APIs. A well-implemented cache can eliminate 30-70% of your API calls entirely โ zero cost, zero latency penalty on cache hits. This guide covers three caching strategies with real implementation examples and cost breakdowns.
Why Caching Works So Well for LLMs
Most AI API workloads have significant request overlap. A customer support bot sees the same questions repeatedly. A content generator processes similar prompts. A classification pipeline handles recurring patterns. Every duplicate request that hits your cache instead of the API is pure savings.
A SaaS company processing 15,000 chat requests/day implemented exact-match caching and immediately reduced their API bill from $450/month to $210/month โ a 53% reduction with zero quality loss.
The key insight: LLM APIs charge per token. If you can serve a response from cache, you pay nothing for that request. Even a 30% cache hit rate means 30% of your costs disappear overnight.
Strategy 1: Exact-Match Caching
The simplest and most reliable caching approach. Store the full prompt + response. If the exact same request comes in again, return the cached result without calling the API.
How it works
- Hash the request (system prompt + user message + model + parameters)
- Check if hash exists in your cache store (Redis, SQLite, or even a dictionary)
- On hit: return cached response immediately
- On miss: call the API, store the response with the hash, return it
import hashlib, json, redis
r = redis.Redis()
def cached_completion(messages, model="gpt-4o-mini", **kwargs):
# Create a cache key from the full request
cache_input = json.dumps({"messages": messages, "model": model, **kwargs}, sort_keys=True)
cache_key = f"llm:{hashlib.sha256(cache_input.encode()).hexdigest()}"
# Check cache
cached = r.get(cache_key)
if cached:
return json.loads(cached)
# Cache miss โ call the API
response = openai.chat.completions.create(messages=messages, model=model, **kwargs)
result = response.model_dump()
# Store in cache (expire after 24 hours)
r.setex(cache_key, 86400, json.dumps(result))
return result
When exact-match caching works best
- FAQ chatbots โ same questions asked repeatedly (40-60% hit rates common)
- Template-based generation โ same inputs produce same outputs
- Classification pipelines โ identical documents reclassified
- Code completion โ same context triggers same suggestions
When it falls short
Exact-match requires identical inputs. If users phrase the same question differently ("What's the price?" vs "How much does it cost?"), they won't match. That's where semantic caching comes in.
Strategy 2: Semantic Caching
Semantic caching matches requests by meaning, not exact text. "How do I reset my password?" and "I forgot my password, how do I fix it?" would both hit the same cache entry because they're semantically equivalent.
How it works
- Generate an embedding vector for each request using a cheap embedding model
- Store the embedding + response in a vector database
- On new requests, search for the nearest embedding (cosine similarity above a threshold)
- On hit: return cached response; On miss: call API, store embedding + response
import numpy as np
from openai import OpenAI
client = OpenAI()
def get_embedding(text):
response = client.embeddings.create(input=text, model="text-embedding-3-small")
return response.data[0].embedding
def semantic_cached_completion(messages, model="gpt-4o-mini",
similarity_threshold=0.92, **kwargs):
# Combine messages into a single query string for embedding
query_text = " ".join(m["content"] for m in messages)
query_embedding = get_embedding(query_text)
# Search vector DB for similar cached queries
similar = vector_db.search(query_embedding, top_k=1,
threshold=similarity_threshold)
if similar:
return similar[0]["response"]
# Cache miss
response = client.chat.completions.create(
messages=messages, model=model, **kwargs
)
result = response.model_dump()
# Store embedding + response
vector_db.insert(query_embedding, {
"response": result,
"query": query_text,
"model": model
})
return result
Semantic caching trade-offs
| Factor | Exact-Match | Semantic |
|---|---|---|
| Hit rate | 20-40% | 40-65% |
| Quality risk | None (exact response) | Low (similar but not identical query) |
| Infrastructure | Redis or in-memory | Vector DB (Pinecone, Weaviate, pgvector) |
| Added latency | <1ms (hash lookup) | 5-20ms (embedding + vector search) |
| Cost to run | Near zero | Embedding cost (~$0.02/1M tokens) |
| Best for | FAQ bots, templates | Conversational, varied phrasing |
Tuning the similarity threshold
The threshold controls how "similar" queries need to be to count as a cache hit. Too low and you'll return wrong answers; too high and you'll miss real matches.
- 0.95+ โ Very conservative. Almost exact meaning. Low risk of wrong answers.
- 0.90-0.95 โ Balanced. Good hit rates with minimal quality risk.
- 0.85-0.90 โ Aggressive. Higher hit rates but risk of semantically different queries matching.
Start at 0.92 and adjust based on your quality metrics. For support bots where accuracy matters, stay above 0.93.
Strategy 3: Provider Prompt Caching
Both OpenAI and Anthropic now offer built-in prompt caching at the API level. This automatically caches the prefix of your prompts and gives you a discount on subsequent requests that share the same prefix.
OpenAI Prompt Caching
- Automatically caches prompts longer than 1,024 tokens (GPT-4o) or 2,048 tokens (GPT-4o mini)
- 90% discount on cached input tokens (you pay only 10% of the input price)
- Cache entries expire after 5-10 minutes of inactivity
- No code changes needed โ it's automatic for supported models
Anthropic Prompt Caching
- Cache prefixes up to 4,096 tokens with explicit caching markers
- 90% discount on cached input tokens
- Cache TTL is 5 minutes (extended on each hit)
- Requires adding
cache_controlparameter to mark the cache breakpoint
# Anthropic prompt caching example
response = client.messages.create(
model="claude-sonnet-4-6-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a helpful assistant for Acme Corp. [long system prompt...]",
"cache_control": {"type": "ephemeral"} # Cache this prefix
}
],
messages=[{"role": "user", "content": "What's the return policy?"}]
)
When provider caching helps most
- Long system prompts โ if your system prompt is 500+ tokens, every request after the first pays 10% instead of 100%
- Multi-turn conversations โ conversation history is the prefix; only new messages are full price
- RAG applications โ large context documents in the prompt get cached automatically
Combining Strategies for Maximum Savings
The best results come from layering multiple caching approaches. Here's a production architecture that combines all three:
- Layer 1 โ Exact-match cache (Redis). Fast, zero-cost lookups. Catches duplicate requests.
- Layer 2 โ Semantic cache (Vector DB). Catches paraphrased versions of cached queries.
- Layer 3 โ Provider prompt caching (Automatic). Even on cache misses, the system prompt prefix is cached by the provider at 90% off.
On a cache miss at Layer 1, you check Layer 2. If both miss, the API call still benefits from provider caching on the prompt prefix. The result: you rarely pay full price for any request.
Cache Invalidation: The Hard Part
Caching is easy. Invalidating caches correctly is where most teams struggle. Strategies:
- Time-based expiration (TTL) โ simplest approach. Set 1-24 hour TTLs based on how stale your data can be. Pricing data? 24 hours is fine. Real-time support? 5-15 minutes.
- Event-driven invalidation โ when your data changes (new pricing, updated docs), invalidate affected cache entries. More complex but more precise.
- Version-prefixed keys โ include a version number in cache keys:
v2:hash(prompt). Bump the version to invalidate everything at once. - Write-through caching โ update cache and DB simultaneously on writes. Ensures consistency but adds write latency.
Rule of thumb: Use TTL for most cases. Only build event-driven invalidation if stale responses would cause user-facing issues.
Measuring Cache Performance
Track these metrics to know if your caching is working:
| Metric | What It Tells You | Target |
|---|---|---|
| Hit rate | % of requests served from cache | 30%+ (exact), 50%+ (semantic) |
| Cost per request | Average API cost divided by total requests | Decreasing over time |
| Cache latency | Time to serve a cached response | <5ms (exact), <25ms (semantic) |
| Stale response rate | % of cached responses that were outdated | <1% |
| Cache size | Storage used by cache entries | Monitor growth, set eviction policies |
See how much you could save with caching.
Enter your current API usage and get a personalized cost projection with and without caching.
Try the APIpulse CalculatorImplementation Checklist
- Identify which request types have high repetition (FAQ, templates, classification)
- Start with exact-match caching โ it's the easiest win
- Measure your natural hit rate before optimizing
- Add semantic caching if exact-match hit rate is below 30%
- Enable provider prompt caching (it's free and automatic)
- Set appropriate TTLs based on data freshness requirements
- Monitor hit rate, cost per request, and stale response rate
- Implement cache invalidation for data-sensitive workloads
- Consider combining all three layers for maximum savings
- Use APIpulse to track your cost-per-request trends
Related Reading
- AI API Cost Optimization: A Complete Guide for 2026
- The Complete Guide to AI API Batch Processing
- How to Reduce Your AI API Costs by 40% (Without Losing Quality)
- How to Cut Your AI API Bill in Half
- Multi-Model Routing: Use the Right LLM for Each Task
- AI API Cost Per Request: How Much Does Each LLM Call Actually Cost?
- Cheapest RAG Setup in 2026: Full Cost Breakdown
Get notified when API prices change
No spam. Only pricing updates and new features. Unsubscribe anytime.
Want to optimize your AI API costs?
APIpulse Pro ($29 one-time) includes saved scenarios, cost report exports, and personalized recommendations that can save you up to 40%.
Get Pro โ $29