← Back to blog

The True Cost of RAG: LLM Pricing for Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) has become the standard architecture for building AI applications that need to reference specific documents, knowledge bases, or real-time data. But RAG pipelines have multiple cost components — and most developers only think about the generation step.

Let's break down the total cost of a RAG pipeline and compare pricing across providers.

The RAG Pipeline: 3 Cost Centers

User Query → Embedding → Vector Search → Context AssemblyGeneration → Response
                $$$                            $$                        $$$$

Every RAG query involves three billable steps:

  1. Embedding: Convert the user's query (and your documents) into vectors
  2. Vector search: Find the most relevant documents (usually self-hosted or fixed cost)
  3. Generation: Use the retrieved context to generate an answer

Let's price each component.

1. Embedding Costs

You need embeddings for two things: indexing your documents (one-time) and embedding each user query (per-request).

Model Cost per 1M tokens Dimensions Best For
OpenAI text-embedding-3-small $0.02 1536 Budget, high-volume
OpenAI text-embedding-3-large $0.13 3072 Higher accuracy
Cohere embed-v4 $0.10 1024 Multilingual
Google text-embedding-005 $0.02 768 Google ecosystem

Key insight: Embedding is cheap. At $0.02 per 1M tokens, embedding 1M tokens of documents costs just $0.02. Even embedding 1,000 queries per day (at 500 tokens each) costs only $0.30/month.

2. Vector Search Costs

Vector search costs depend on your infrastructure:

For most startups, the free tier is sufficient initially. At scale, vector search typically costs $25-70/month — a fixed cost regardless of query volume.

3. Generation Costs (The Big One)

Generation is where most of your RAG budget goes. A typical RAG query sends:

Total RAG Cost Comparison

Let's calculate the total monthly cost for a RAG application serving 1,000 queries per day (30,000/month).

Monthly RAG Cost — 1,000 queries/day (Budget Setup)

Embedding: text-embedding-3-small | Generation: Budget model | Vector search: Free tier

Embedding (30K queries × 500 tokens) $0.00
Vector search (free tier) $0.00
Generation — Gemini 2.0 Flash $1.08/mo
Generation — GPT-4o mini $1.62/mo
Generation — Claude Haiku 4.5 $10.80/mo

Monthly RAG Cost — 1,000 queries/day (Quality Setup)

Embedding: text-embedding-3-large | Generation: Premium model | Vector search: $25/mo

Embedding (30K queries × 500 tokens) $0.00
Vector search $25.00
Generation — GPT-4o $40.50/mo
Generation — Claude Sonnet 4 $67.50/mo

Scaling RAG Costs

How costs grow with query volume:

Queries/Day Flash (Budget) GPT-4o (Premium) Sonnet 4 (Premium)
100 $0.11/mo $4.05/mo $6.75/mo
1,000 $1.08/mo $40.50/mo $67.50/mo
10,000 $10.80/mo $405/mo $675/mo
100,000 $108/mo $4,050/mo $6,750/mo

At 100K queries/day, the difference between Flash ($108/mo) and Sonnet 4 ($6,750/mo) is $6,642/month — enough to hire a developer.

How to Reduce RAG Costs

  1. Use budget models for generation: Gemini Flash or GPT-4o mini handle most RAG queries well — the retrieved context does the heavy lifting
  2. Optimize chunk size: Smaller chunks = fewer tokens per query. Aim for 200-400 tokens per chunk
  3. Limit retrieved chunks: 3-5 chunks is usually enough. More chunks = more input tokens
  4. Cache common queries: If the same question gets asked repeatedly, cache the response
  5. Compress context: Summarize retrieved chunks before sending to the LLM
  6. Use hybrid search: Combine vector search with keyword search to improve relevance and reduce the number of chunks needed

Recommended RAG Stack by Budget

Startup (< $10/month)

Growth ($50-200/month)

Enterprise ($500+/month)

Calculate your RAG pipeline costs. Enter your exact usage and see what each model would cost.

Try the APIpulse Calculator or Compare Models Side-by-Side