The True Cost of RAG: LLM Pricing for Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) has become the standard architecture for building AI applications that need to reference specific documents, knowledge bases, or real-time data. But RAG pipelines have multiple cost components — and most developers only think about the generation step.
Let's break down the total cost of a RAG pipeline and compare pricing across providers.
The RAG Pipeline: 3 Cost Centers
$$$ $$ $$$$
Every RAG query involves three billable steps:
- Embedding: Convert the user's query (and your documents) into vectors
- Vector search: Find the most relevant documents (usually self-hosted or fixed cost)
- Generation: Use the retrieved context to generate an answer
Let's price each component.
1. Embedding Costs
You need embeddings for two things: indexing your documents (one-time) and embedding each user query (per-request).
| Model | Cost per 1M tokens | Dimensions | Best For |
|---|---|---|---|
| OpenAI text-embedding-3-small | $0.02 | 1536 | Budget, high-volume |
| OpenAI text-embedding-3-large | $0.13 | 3072 | Higher accuracy |
| Cohere embed-v4 | $0.10 | 1024 | Multilingual |
| Google text-embedding-005 | $0.02 | 768 | Google ecosystem |
Key insight: Embedding is cheap. At $0.02 per 1M tokens, embedding 1M tokens of documents costs just $0.02. Even embedding 1,000 queries per day (at 500 tokens each) costs only $0.30/month.
2. Vector Search Costs
Vector search costs depend on your infrastructure:
- Pinecone: Free tier (100K vectors), $70/mo for 1M vectors
- Weaviate Cloud: Free tier (100K vectors), $25/mo for 1M vectors
- Qdrant Cloud: Free tier (1GB), $25/mo for 1GB+
- Self-hosted (pgvector): Free (just your server costs)
For most startups, the free tier is sufficient initially. At scale, vector search typically costs $25-70/month — a fixed cost regardless of query volume.
3. Generation Costs (The Big One)
Generation is where most of your RAG budget goes. A typical RAG query sends:
- System prompt: ~500 tokens
- Retrieved context: 3-5 chunks × ~300 tokens = ~1,200 tokens
- User query: ~100 tokens
- Total input: ~1,800 tokens per query
- Output: ~300-500 tokens per response
Total RAG Cost Comparison
Let's calculate the total monthly cost for a RAG application serving 1,000 queries per day (30,000/month).
Monthly RAG Cost — 1,000 queries/day (Budget Setup)
Embedding: text-embedding-3-small | Generation: Budget model | Vector search: Free tier
Monthly RAG Cost — 1,000 queries/day (Quality Setup)
Embedding: text-embedding-3-large | Generation: Premium model | Vector search: $25/mo
Scaling RAG Costs
How costs grow with query volume:
| Queries/Day | Flash (Budget) | GPT-4o (Premium) | Sonnet 4 (Premium) |
|---|---|---|---|
| 100 | $0.11/mo | $4.05/mo | $6.75/mo |
| 1,000 | $1.08/mo | $40.50/mo | $67.50/mo |
| 10,000 | $10.80/mo | $405/mo | $675/mo |
| 100,000 | $108/mo | $4,050/mo | $6,750/mo |
At 100K queries/day, the difference between Flash ($108/mo) and Sonnet 4 ($6,750/mo) is $6,642/month — enough to hire a developer.
How to Reduce RAG Costs
- Use budget models for generation: Gemini Flash or GPT-4o mini handle most RAG queries well — the retrieved context does the heavy lifting
- Optimize chunk size: Smaller chunks = fewer tokens per query. Aim for 200-400 tokens per chunk
- Limit retrieved chunks: 3-5 chunks is usually enough. More chunks = more input tokens
- Cache common queries: If the same question gets asked repeatedly, cache the response
- Compress context: Summarize retrieved chunks before sending to the LLM
- Use hybrid search: Combine vector search with keyword search to improve relevance and reduce the number of chunks needed
Recommended RAG Stack by Budget
Startup (< $10/month)
- Embedding: OpenAI text-embedding-3-small ($0.02/1M)
- Vector DB: Pinecone free tier or pgvector
- Generation: Gemini 2.0 Flash ($0.10/$0.40)
Growth ($50-200/month)
- Embedding: OpenAI text-embedding-3-large ($0.13/1M)
- Vector DB: Pinecone or Qdrant ($25-70/mo)
- Generation: GPT-4o ($2.50/$10.00) for quality, Flash for volume
Enterprise ($500+/month)
- Embedding: Custom fine-tuned embeddings
- Vector DB: Dedicated cluster with replication
- Generation: Claude Sonnet 4 for complex queries, Flash for simple ones
- Caching: Redis cache for common queries
Calculate your RAG pipeline costs. Enter your exact usage and see what each model would cost.
Try the APIpulse Calculator or Compare Models Side-by-Side