← Back to blog

Guide April 25, 2026

The True Cost of RAG: LLM Pricing for Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) has become the standard architecture for building AI applications that need to reference specific documents, knowledge bases, or real-time data. But RAG pipelines have multiple cost components — and most developers only think about the generation step.

Let's break down the total cost of a RAG pipeline and compare pricing across providers.

The RAG Pipeline: 3 Cost Centers

User Query → Embedding → Vector Search → Context Assembly → Generation → Response
$$$ $$ $$$$

Every RAG query involves three billable steps:

Embedding: Convert the user's query (and your documents) into vectors
Vector search: Find the most relevant documents (usually self-hosted or fixed cost)
Generation: Use the retrieved context to generate an answer

Let's price each component.

1. Embedding Costs

You need embeddings for two things: indexing your documents (one-time) and embedding each user query (per-request).

Model	Cost per 1M tokens	Dimensions	Best For
OpenAI text-embedding-3-small	$0.02	1536	Budget, high-volume
OpenAI text-embedding-3-large	$0.13	3072	Higher accuracy
Cohere embed-v4	$0.10	1024	Multilingual
Google text-embedding-005	$0.02	768	Google ecosystem

Key insight: Embedding is cheap. At $0.02 per 1M tokens, embedding 1M tokens of documents costs just $0.02. Even embedding 1,000 queries per day (at 500 tokens each) costs only $0.30/month.

2. Vector Search Costs

Vector search costs depend on your infrastructure:

Pinecone: Free tier (100K vectors), $70/mo for 1M vectors
Weaviate Cloud: Free tier (100K vectors), $25/mo for 1M vectors
Qdrant Cloud: Free tier (1GB), $25/mo for 1GB+
Self-hosted (pgvector): Free (just your server costs)

For most startups, the free tier is sufficient initially. At scale, vector search typically costs $25-70/month — a fixed cost regardless of query volume.

3. Generation Costs (The Big One)

Generation is where most of your RAG budget goes. A typical RAG query sends:

System prompt: ~500 tokens
Retrieved context: 3-5 chunks × ~300 tokens = ~1,200 tokens
User query: ~100 tokens
Total input: ~1,800 tokens per query
Output: ~300-500 tokens per response

Total RAG Cost Comparison

Let's calculate the total monthly cost for a RAG application serving 1,000 queries per day (30,000/month).

Monthly RAG Cost — 1,000 queries/day (Budget Setup)

Embedding: text-embedding-3-small | Generation: Budget model | Vector search: Free tier

Embedding (30K queries × 500 tokens) $0.00

Vector search (free tier) $0.00

Generation — Gemini 2.0 Flash $1.08/mo

Generation — GPT-4o mini $1.62/mo

Generation — Claude Haiku 4.5 $10.80/mo

Monthly RAG Cost — 1,000 queries/day (Quality Setup)

Embedding: text-embedding-3-large | Generation: Premium model | Vector search: $25/mo

Embedding (30K queries × 500 tokens) $0.00

Vector search $25.00

Generation — GPT-4o $40.50/mo

Generation — Claude Sonnet 4 $67.50/mo

Scaling RAG Costs

How costs grow with query volume:

Queries/Day	Flash (Budget)	GPT-4o (Premium)	Sonnet 4 (Premium)
100	$0.11/mo	$4.05/mo	$6.75/mo
1,000	$1.08/mo	$40.50/mo	$67.50/mo
10,000	$10.80/mo	$405/mo	$675/mo
100,000	$108/mo	$4,050/mo	$6,750/mo

At 100K queries/day, the difference between Flash ($108/mo) and Sonnet 4 ($6,750/mo) is $6,642/month — enough to hire a developer.

How to Reduce RAG Costs

Use budget models for generation: Gemini Flash or GPT-4o mini handle most RAG queries well — the retrieved context does the heavy lifting
Optimize chunk size: Smaller chunks = fewer tokens per query. Aim for 200-400 tokens per chunk
Limit retrieved chunks: 3-5 chunks is usually enough. More chunks = more input tokens
Cache common queries: If the same question gets asked repeatedly, cache the response
Compress context: Summarize retrieved chunks before sending to the LLM
Use hybrid search: Combine vector search with keyword search to improve relevance and reduce the number of chunks needed

Recommended RAG Stack by Budget

Startup (< $10/month)

Embedding: OpenAI text-embedding-3-small ($0.02/1M)
Vector DB: Pinecone free tier or pgvector
Generation: Gemini 2.0 Flash ($0.10/$0.40)

Growth ($50-200/month)

Embedding: OpenAI text-embedding-3-large ($0.13/1M)
Vector DB: Pinecone or Qdrant ($25-70/mo)
Generation: GPT-4o ($2.50/$10.00) for quality, Flash for volume

Enterprise ($500+/month)

Embedding: Custom fine-tuned embeddings
Vector DB: Dedicated cluster with replication
Generation: Claude Sonnet 4 for complex queries, Flash for simple ones
Caching: Redis cache for common queries

Calculate your RAG pipeline costs. Enter your exact usage and see what each model would cost.

Try the APIpulse Calculator or Compare Models Side-by-Side