AI API Glossary
Every term you need to understand LLM pricing, from tokens to fine-tuning. Click any term for a clear definition and real-world examples.
A
API (Application Programming Interface)
A way for software to communicate with an LLM service. You send a request (your prompt) via HTTP and receive a response (the model's output). API pricing is based on the tokens processed in each request.
B
Batch API
A bulk processing endpoint that accepts multiple requests and processes them asynchronously, typically at 50% discount. Ideal for non-time-sensitive workloads like data processing and content generation.
Context Window
The maximum number of tokens a model can process in a single request โ both input and output combined. Larger windows allow processing longer documents but cost more per request.
Credits
Pre-paid balance applied against API usage. Most providers offer initial credits ($5โ10) for new accounts. Credits expire after a set period (usually 3โ12 months) and are consumed before charging your payment method.
E
Embedding
A numerical vector representation of text, used for semantic search, classification, and RAG systems. Embedding models are separate from generative models and are much cheaper โ typically $0.02โ0.10 per 1M tokens.
Endpoint
The URL you send API requests to. Each model or feature has its own endpoint. For example, OpenAI's chat completions endpoint is api.openai.com/v1/chat/completions.
F
Foundation Model
A large pre-trained model that can be adapted to many tasks. Examples: GPT-4o, Claude Sonnet 4, Gemini 2.5 Pro. Foundation models are trained on vast datasets and perform reasoning, coding, analysis, and generation without task-specific training.
Learn more: 2026 Flagship LLM Showdown โFine-Tuning
Training a model on your specific data to improve performance for your use case. Has upfront training costs and creates a custom model with its own inference pricing. For most applications, prompt engineering + RAG is more cost-effective.
Learn more: Open Source vs Commercial LLMs โFree Tier
A usage allowance provided at no cost, typically with rate limits. Google Gemini offers the most generous free tier โ unlimited requests with rate limiting. OpenAI and Anthropic offer $5 credits for new accounts.
Learn more: AI API Free Tiers Compared โFunction Calling / Tool Use
The ability of a model to generate structured JSON output that triggers external functions or API calls. This enables LLMs to interact with databases, APIs, and services. Strongest in OpenAI and Anthropic models.
Learn more: How to Build an AI Agent on a Budget โI
Input Tokens
Tokens in your prompt โ the text you send to the model. Includes system prompt, user message, conversation history, and any context. Input tokens are always cheaper than output tokens.
L
Latency
The time between sending a request and receiving the first token of response (TTFT). Lower latency = faster perceived response. Budget models are typically faster than premium models.
M
Max Tokens (Output Limit)
A parameter you set to limit how many tokens the model can generate. Setting this too low truncates responses; setting it too high wastes money. Typical values: 256โ4,096 tokens.
Model Routing
Automatically selecting the cheapest model that can handle each specific task. Simple queries go to budget models; complex reasoning goes to premium models. Can reduce costs by 60โ80%.
Model Tier
A classification of models by capability and price. Premium (Claude 4 Opus, GPT-5): maximum quality. Budget (Gemini Flash, GPT-4o mini): minimal cost. Mid-tier (Claude Sonnet 4, GPT-4o): balanced.
Compare all models by tier โMultimodal
Models that process multiple input types โ text, images, audio, and sometimes video. GPT-4o and Gemini 2.5 Pro are fully multimodal. Image inputs cost more than text (typically 2โ5x per token).
O
Open Source Model
A model whose weights are publicly available, allowing self-hosting or use through third-party APIs. Examples: Llama (Meta), Mistral. Can be cheaper at scale but require infrastructure to run.
Output Tokens
Tokens the model generates in response. Typically 3โ10x more expensive than input tokens because generation requires more compute. The model generates one token at a time, sequentially.
P
Per-Token Pricing
The standard billing model for LLM APIs. You're charged based on the number of tokens processed โ both input and output. Prices are quoted per 1 million tokens.
Prompt Caching
Storing frequently-used prompt prefixes (like system prompts) to avoid re-processing them. Cached tokens are charged at a 50โ90% discount. Anthropic and OpenAI both support automatic prompt caching.
Prompt Engineering
The practice of designing inputs to get better, cheaper outputs from LLMs. Techniques include clear instructions, few-shot examples, structured output formats, and chain-of-thought reasoning. Can reduce token usage by 30โ50%.
Learn more: How to Reduce AI API Costs by 40% โR
RAG (Retrieval-Augmented Generation)
A technique combining document retrieval with LLM generation. Relevant documents are retrieved from a vector database and included in the prompt, letting the model answer questions about specific data without fine-tuning.
Rate Limit
The maximum number of requests or tokens you can send within a time window (usually per minute or per day). Exceeding rate limits returns a 429 error. Limits vary by model, tier, and spending level.
Learn more: AI API Rate Limits Compared โS
Streaming
Sending response tokens to the client as they're generated, rather than waiting for the complete response. Doesn't reduce cost but dramatically improves perceived performance. Most providers support SSE (Server-Sent Events).
Learn more: LLM API Latency Benchmarks โStructured Output
Forcing the model to respond in a specific format (JSON, XML, etc.). Reduces post-processing costs and improves reliability. Some providers have dedicated structured output modes.
System Prompt
A special instruction sent before the user's message that defines the model's behavior, role, and constraints. Included in input token count and persists across the conversation. Sent with every request โ can be expensive.
T
Temperature
A parameter (0.0โ2.0) controlling randomness. Lower values (0.0โ0.3) = deterministic, focused outputs. Higher values (0.7โ1.0) = creative, varied outputs. Doesn't affect pricing but impacts output length and quality.
Throughput
The number of tokens generated per second (tok/s). Higher throughput = faster complete responses. Important for batch processing and high-volume applications.
Learn more: LLM API Latency Benchmarks โToken
The fundamental unit of text processing. Roughly 4 characters or 0.75 words in English. Tokens can be whole words, parts of words, or individual characters. Exact tokenization varies by model.
Token Budget
A maximum token limit you set per request or per day to control costs. Prevents runaway costs from infinite loops or unexpectedly long responses. Combine with max_tokens and daily spending limits.
Cost Per 1M Tokens
The standard unit for comparing API prices. Instead of quoting per-token prices (tiny fractions of a cent), providers quote cost per 1 million tokens. To calculate: (tokens รท 1,000,000) ร price per 1M.
Cost Per Request
The total cost of a single API call. Calculated as: (input_tokens ร input_price + output_tokens ร output_price) รท 1,000,000.
Monthly Burn Rate
Your projected monthly API spend based on current usage. Calculated as: average cost per request ร requests per day ร 30. Monitoring burn rate helps you stay within budget.
Calculate your exact costs. Enter your usage patterns into our calculator to see your monthly burn rate and find the cheapest model for your needs.
Try the APIpulse Calculator or View Full Pricing Index