Cheapest AI API by Use Case: Chatbots, Code Gen, RAG & More

42 models compared across 7 real-world workloads. Stop guessing — find the cheapest AI API for your exact use case with per-request cost breakdowns.

"What's the cheapest AI API?" is the wrong question. The right question is: "What's the cheapest AI API for what I'm building?"

A chatbot that sends short messages and gets short replies has completely different cost drivers than a RAG system that processes 10,000 tokens of context per query. The cheapest model for one workload can be 10× more expensive for another.

This guide breaks down the cheapest AI API for 7 common use cases — with real per-request costs calculated from current pricing data across all 42 models.

💡 Key insight: Output tokens cost 2-6× more than input tokens across every provider. The cheapest API for your use case depends on your input/output ratio — not just the per-token price.

🤖 Chatbots & Conversational AI

🏆 Winner: DeepSeek V4 Flash

Typical workload: 500 input tokens (system prompt + history) → 200 output tokens (reply) per turn. 10 turns per conversation.

ModelInputOutputCost/ConversationSavings vs GPT-5
DeepSeek V4 Flash$0.14/M$0.28/M$0.0003↓ 92%
Llama 3.1 8B$0.10/M$0.10/M$0.0002↓ 95%
Gemini 2.5 Flash-Lite$0.10/M$0.40/M$0.0003↓ 92%
GPT-5 mini$0.25/M$2.00/M$0.0017↓ 55%
Claude Haiku 4.5$1.00/M$5.00/M$0.0055
GPT-5$1.25/M$10.00/M$0.0038baseline

Verdict: For chatbots, DeepSeek V4 Flash delivers solid conversation quality at $0.0003/conversation — 92% cheaper than GPT-5. Llama 3.1 8B is even cheaper but with noticeably lower response quality for complex conversations. If you need GPT-5-level quality, DeepSeek V4 Pro ($0.435/$0.87) is still 85% cheaper.

When to spend more: Customer-facing chatbots handling sensitive topics (healthcare, finance) benefit from Claude Haiku 4.5 or GPT-5 mini's better instruction-following and safety guardrails.

💻 Code Generation & Completion

🏆 Winner: DeepSeek V4 Pro

Typical workload: 2,000 input tokens (prompt + context) → 4,000 output tokens (generated code). Code generation is output-heavy — output tokens dominate cost.

ModelInputOutputCost/RequestSavings vs Codex
DeepSeek V4 Pro$0.435/M$0.87/M$0.0039↓ 93%
DeepSeek V4 Flash$0.14/M$0.28/M$0.0013↓ 98%
Mistral Large 3$0.50/M$1.50/M$0.0069↓ 88%
Gemini 3 Flash$0.50/M$3.00/M$0.013↓ 78%
Claude Sonnet 4.6$3.00/M$15.00/M$0.066
GPT-5.3 Codex$1.75/M$14.00/M$0.0595baseline

Verdict: DeepSeek V4 Pro is the clear winner for code generation — 93% cheaper than GPT-5.3 Codex with comparable code quality for most languages. Its output token pricing ($0.87/M) is absurdly cheap for code workloads where you're generating thousands of tokens per request.

When to spend more: Complex multi-file refactoring or code requiring deep reasoning about large codebases benefits from Claude Sonnet 4.6's superior context handling. For simple completions and boilerplate, DeepSeek V4 Flash at $0.0013/request is unbeatable.

📚 RAG (Retrieval-Augmented Generation)

🏆 Winner: DeepSeek V4 Pro (quality) / Gemini 2.5 Flash-Lite (volume)

Typical workload: 10,000 input tokens (retrieved context) → 1,000 output tokens (answer). RAG is input-heavy — large context windows, shorter outputs.

ModelInputOutputCost/QuerySavings vs GPT-5
Gemini 2.5 Flash-Lite$0.10/M$0.40/M$0.0014↓ 95%
DeepSeek V4 Flash$0.14/M$0.28/M$0.0017↓ 94%
DeepSeek V4 Pro$0.435/M$0.87/M$0.0052↓ 83%
Gemini 3 Flash$0.50/M$3.00/M$0.008↓ 74%
Claude Haiku 4.5$1.00/M$5.00/M$0.015↓ 50%
GPT-5$1.25/M$10.00/M$0.0225baseline

Verdict: RAG's input-heavy nature makes cheap input tokens critical. Gemini 2.5 Flash-Lite ($0.10/M input) is 95% cheaper than GPT-5 for RAG queries. If you need higher answer quality, DeepSeek V4 Pro at $0.0052/query is still 83% cheaper than GPT-5 with better reasoning on complex retrieved context.

Pro tip: For high-volume RAG (10K+ queries/day), consider a tiered approach — route simple factual queries to Flash-Lite and complex analytical queries to DeepSeek V4 Pro. This hybrid approach can cut costs by 90%+ while maintaining quality where it matters.

📝 Text Summarization

🏆 Winner: Gemini 2.5 Flash-Lite

Typical workload: 5,000 input tokens (document) → 300 output tokens (summary). Summarization is the most input-heavy common workload.

ModelInputOutputCost/DocumentSavings vs GPT-5
Gemini 2.5 Flash-Lite$0.10/M$0.40/M$0.0006↓ 91%
DeepSeek V4 Flash$0.14/M$0.28/M$0.0008↓ 88%
Llama 3.1 8B$0.10/M$0.10/M$0.0005↓ 92%
Mistral Small 4$0.10/M$0.30/M$0.0006↓ 91%
Claude Haiku 4.5$1.00/M$5.00/M$0.0066
GPT-5$1.25/M$10.00/M$0.0093baseline

Verdict: Summarization is input-dominated, making cheap input tokens everything. Gemini 2.5 Flash-Lite at $0.10/M input is 91% cheaper than GPT-5. For summarizing 1,000 documents/day, you're looking at $0.60/day vs $9.25/day — saving $260/month on a single workload.

Quality note: For simple extractive summaries (pull key points), Flash-Lite is excellent. For abstractive summaries requiring deep understanding (rephrase, synthesize, analyze), Claude Haiku 4.5 produces noticeably better results at 10× the cost — still far cheaper than GPT-5.

🔢 Embeddings & Vector Search

🏆 Winner: OpenAI text-embedding-3-small

Typical workload: 500 input tokens per document, no output tokens. Pure embedding generation for vector databases, semantic search, and classification.

ModelPriceCost/1M DocsNotes
OpenAI text-embedding-3-small$0.02/M tokens$101536 dimensions, great quality
Cohere embed-english-v3.0$0.10/M tokens$501024 dimensions, excellent for search
Mistral Small 4 (as embedder)$0.10/M input$50Can do embed + generate in one call
Voyage AI voyage-3$0.06/M tokens$301024 dimensions, strong retrieval

Verdict: OpenAI's embedding model is the cheapest dedicated option at $0.02/M tokens. For a database of 1M documents (500 tokens each), embedding costs just $10. The real cost of embeddings is usually the vector database hosting, not the embedding API.

Pro tip: If you're already using a chat/completion model for RAG, some providers (Mistral, Cohere) let you use the same model for both embedding and generation — simplifying your stack and potentially reducing API calls.

✍️ Content Generation (Marketing, Copywriting)

🏆 Winner: DeepSeek V4 Pro

Typical workload: 500 input tokens (brief) → 2,000 output tokens (article/copy). Content generation is output-heavy with high creative requirements.

ModelInputOutputCost/PieceQuality Rating
DeepSeek V4 Pro$0.435/M$0.87/M$0.002⭐⭐⭐⭐
Gemini 3 Flash$0.50/M$3.00/M$0.0063⭐⭐⭐⭐
Claude Sonnet 4.6$3.00/M$15.00/M$0.0315⭐⭐⭐⭐⭐
GPT-5$1.25/M$10.00/M$0.0203⭐⭐⭐⭐½
Claude Opus 4.8$5.00/M$25.00/M$0.051⭐⭐⭐⭐⭐

Verdict: DeepSeek V4 Pro at $0.002 per piece of content is 90% cheaper than GPT-5 and produces surprisingly good marketing copy. For brand-sensitive content where tone and voice matter most, Claude Sonnet 4.6 is worth the 15× premium — its writing quality is noticeably more natural and engaging.

Volume math: If you're generating 100 pieces of content/month, DeepSeek V4 Pro costs $0.20. Claude Sonnet 4.6 costs $3.15. GPT-5 costs $2.03. The quality difference between DeepSeek and GPT-5 is much smaller than the price difference.

🔍 Data Extraction & Structured Output

🏆 Winner: Gemini 3 Flash

Typical workload: 3,000 input tokens (document) → 500 output tokens (extracted JSON/data). Balanced input/output, but requires reliable structured output formatting.

ModelInputOutputCost/ExtractionJSON Reliability
Gemini 3 Flash$0.50/M$3.00/M$0.003⭐⭐⭐⭐⭐
DeepSeek V4 Pro$0.435/M$0.87/M$0.0017⭐⭐⭐⭐
GPT-5 mini$0.25/M$2.00/M$0.0018⭐⭐⭐⭐⭐
Claude Haiku 4.5$1.00/M$5.00/M$0.0055⭐⭐⭐⭐⭐
GPT-5$1.25/M$10.00/M$0.0088⭐⭐⭐⭐⭐

Verdict: DeepSeek V4 Pro is cheapest per-extraction ($0.0017) but Gemini 3 Flash ($0.003) has better structured output reliability — critical when you're parsing extracted JSON in production. GPT-5 mini ($0.0018) offers excellent JSON reliability at near-DeepSeek prices.

Reliability tip: For data extraction, JSON validity matters more than raw cost. A model that produces invalid JSON 5% of the time costs you in retry logic, error handling, and downstream failures. Pay the small premium for models with proven structured output (Gemini 3 Flash, GPT-5 mini, Claude Haiku 4.5).

The Input/Output Ratio Rule

The biggest mistake developers make when choosing an AI API: comparing models by per-token price without considering their workload's input/output ratio.

Here's why it matters:

🧮 Quick formula: Monthly cost = (monthly input tokens × input price) + (monthly output tokens × output price). A model with $0.10 input / $10.00 output is NOT cheaper than $1.00 input / $1.00 output if your workload is 50/50. Do the math.

Cost Comparison by Monthly Volume

Here's what your monthly bill looks like across 3 common workload profiles, at different volumes:

🤖 Chatbot (500 in / 200 out per conversation, 10 turns)

VolumeDeepSeek V4 FlashGPT-5 miniGPT-5Savings
1K conversations/mo$0.30$1.70$3.80$3.50/mo
10K conversations/mo$3.00$17.00$38.00$35/mo
100K conversations/mo$30.00$170.00$380.00$350/mo

💻 Code Gen (2K in / 4K out per request)

VolumeDeepSeek V4 ProClaude Sonnet 4.6GPT-5.3 CodexSavings
1K requests/mo$3.90$66.00$59.50$55.60/mo
10K requests/mo$39.00$660.00$595.00$556/mo
50K requests/mo$195.00$3,300.00$2,975.00$2,780/mo

📚 RAG (10K in / 1K out per query)

VolumeFlash-LiteDeepSeek V4 ProGPT-5Savings
1K queries/mo$1.40$5.20$22.50$21.10/mo
10K queries/mo$14.00$52.00$225.00$211/mo
100K queries/mo$140.00$520.00$2,250.00$2,110/mo

Find the cheapest AI API for your exact use case

Don't guess — calculate. The APIpulse Recommendation Engine analyzes your use case, quality needs, and volume to recommend the top 3 models with projected monthly costs.

Find My Model →
Open Cost Calculator Get Pro — $29

The Hidden Costs Nobody Talks About

Per-token pricing is only part of the equation. These hidden costs can dwarf your API bill:

1. Latency vs Throughput Tradeoff

Cheaper models often have higher latency. If your chatbot takes 5 seconds to respond instead of 1 second, you lose users. Factor in the cost of lost conversions when choosing "the cheapest" option.

2. Retry Costs

Models with lower structured output reliability (some open-weight models) require JSON retry logic. A 5% retry rate on 100K requests/month = 5,000 extra API calls. That's a hidden 5% cost increase.

3. Context Window Waste

If you're paying for a 1M context window but only using 10K tokens, you're not wasting money on the unused context — but you are wasting money if a cheaper model with a smaller context window would suffice.

4. Prompt Engineering Overhead

Cheaper models often need more detailed prompts to match quality. If your engineers spend 10 extra hours/month tweaking prompts to save $50 on API costs, you're losing money.

How to Switch Models (Without Breaking Everything)

Found a cheaper model? Here's how to switch safely:

  1. A/B test first — Route 10% of traffic to the new model, compare quality metrics (user ratings, task completion, error rates).
  2. Use the same prompt — Most modern models handle similar prompt formats. Test with your existing prompts before rewriting.
  3. Monitor output distribution — If the new model produces longer/shorter outputs, your downstream systems might break.
  4. Keep a fallback — Route failed requests to your original model. The cost of a failed request far exceeds the savings from a cheaper model.
  5. Track per-model costs separately — Use APIpulse's calculator to model costs before switching, then verify against real usage.