LLM API Glossary: Every Term You Need to Know (2026)
Whether you're evaluating AI API providers for the first time or optimizing costs for an existing application, understanding the terminology is essential. This glossary covers every term you'll encounter in LLM API pricing, from basic concepts like tokens to advanced topics like batching and streaming.
Table of Contents
Pricing & Billing Terms Token & Input Terms Model & Architecture Terms Performance & Limits Terms Feature Terms Cost Optimization TermsPricing & Billing Terms
Per-Token Pricing
The standard billing model for LLM APIs. You're charged based on the number of tokens processed โ both input (what you send) and output (what the model generates). Prices are typically quoted per 1 million tokens.
Input Tokens
Tokens in your prompt โ the text you send to the model. This includes your system prompt, user message, conversation history, and any context. Input tokens are always cheaper than output tokens.
Output Tokens
Tokens the model generates in response. Output tokens are typically 3-10x more expensive than input tokens because generation requires more compute. The model generates one token at a time, sequentially.
Cost Per 1M Tokens
The standard unit for comparing API prices. Instead of quoting per-token prices (which would be tiny fractions of a cent), providers quote cost per 1 million tokens. To calculate your actual cost: (tokens used รท 1,000,000) ร price per 1M.
Free Tier
A usage allowance provided at no cost, typically with rate limits. Google Gemini offers the most generous free tier โ unlimited requests with rate limiting. OpenAI and Anthropic offer $5 credits for new accounts.
Rate Limit
The maximum number of requests or tokens you can send within a time window (usually per minute or per day). Exceeding rate limits returns a 429 error. Limits vary by model, tier, and spending level.
Credits
Pre-paid balance applied against API usage. Most providers offer initial credits ($5-10) for new accounts. Credits expire after a set period (usually 3-12 months) and are consumed before charging your payment method.
Token & Input Terms
Token
The fundamental unit of text processing in LLMs. A token is roughly 4 characters or 0.75 words in English. Tokens can be whole words, parts of words, or individual characters. The exact tokenization varies by model (GPT uses tiktoken, Claude uses its own tokenizer).
Context Window
The maximum number of tokens a model can process in a single request โ including both input and output. This limits how much text you can send and receive. Larger context windows allow processing longer documents but cost more.
System Prompt
A special instruction sent before the user's message that defines the model's behavior, role, and constraints. System prompts are included in input token count and persist across the conversation. They're typically more expensive than user messages because they're sent with every request.
Conversation History
The accumulated messages in a chat session. Each request typically includes the full conversation history, meaning costs grow linearly with conversation length. A 20-message conversation might use 10,000+ input tokens per request.
Max Tokens (Output Limit)
A parameter you set to limit how many tokens the model can generate in its response. Setting this too low truncates responses; setting it too high wastes money if the model generates unnecessary text. Typical values: 256-4,096 tokens.
Embedding
A numerical vector representation of text, used for semantic search, classification, and RAG systems. Embedding models are separate from generative models and have their own pricing (typically much cheaper). Embeddings are generated once and reused.
Model & Architecture Terms
Foundation Model
A large pre-trained model that can be adapted to many tasks. Examples: GPT-4o, Claude Sonnet 4, Gemini 2.5 Pro. Foundation models are trained on vast datasets and can perform reasoning, coding, analysis, and generation without task-specific training.
Model Tier
A classification of models by capability and price. Premium models (Claude 4 Opus, GPT-5, Gemini 2.5 Pro) offer maximum quality at higher cost. Budget models (Gemini 2.0 Flash, GPT-4o mini, Mistral Small 4) offer good quality at minimal cost. Mid-tier models (Claude Sonnet 4, GPT-4o) balance both.
Open Source Model
A model whose weights are publicly available, allowing self-hosting or use through third-party APIs. Examples: Llama 3.1 (Meta), Mistral (Mistral AI). Open source models can be cheaper at scale but require infrastructure to run.
Temperature
A parameter (0.0-2.0) controlling randomness in responses. Lower values (0.0-0.3) produce more deterministic, focused outputs. Higher values (0.7-1.0) produce more creative, varied outputs. Temperature doesn't affect pricing but impacts output length and quality.
Function Calling / Tool Use
The ability of a model to generate structured JSON output that triggers external functions or API calls. This enables LLMs to interact with databases, APIs, and services. Not all models support this โ it's strongest in OpenAI and Anthropic models.
Multimodal
Models that can process multiple types of input โ text, images, audio, and sometimes video. GPT-4o and Gemini 2.5 Pro are fully multimodal. Image inputs cost more than text (typically 2-5x per token).
Performance & Limits Terms
Latency
The time between sending a request and receiving the first token of response (Time to First Token, or TTFT). Lower latency = faster perceived response. Budget models are typically faster than premium models. Streaming reduces perceived latency by showing tokens as they generate.
Throughput
The number of tokens generated per second (tokens/sec). Higher throughput = faster complete responses. Important for batch processing and high-volume applications. Throughput varies by model, load, and response length.
Streaming
Sending response tokens to the client as they're generated, rather than waiting for the complete response. Streaming doesn't reduce cost (you still pay for all tokens) but dramatically improves perceived performance. Most providers support Server-Sent Events (SSE) for streaming.
Batch API
A bulk processing endpoint that accepts multiple requests and processes them asynchronously, typically at 50% discount. Ideal for non-time-sensitive workloads like data processing, content generation, and analysis. OpenAI and Anthropic both offer batch APIs.
Timeout
The maximum time a provider will wait for a response before returning an error. Longer responses and higher max_tokens settings increase the chance of timeouts. Typical timeout: 60-120 seconds. Set appropriate timeouts in your application.
Feature Terms
RAG (Retrieval-Augmented Generation)
A technique that combines document retrieval with LLM generation. Relevant documents are retrieved from a vector database and included in the prompt, allowing the model to answer questions about specific data without fine-tuning. Costs include: embedding, vector search, and generation.
Fine-Tuning
Training a model on your specific data to improve performance for your use case. Fine-tuning has upfront training costs and creates a custom model with its own inference pricing. For most applications, prompt engineering + RAG is more cost-effective than fine-tuning.
Prompt Engineering
The practice of designing inputs to get better, cheaper outputs from LLMs. Techniques include: clear instructions, few-shot examples, structured output formats, and chain-of-thought reasoning. Good prompt engineering can reduce token usage by 30-50%.
Structured Output
Forcing the model to respond in a specific format (JSON, XML, etc.). Structured output reduces post-processing costs and improves reliability. Some providers (OpenAI, Anthropic) have dedicated structured output modes that are more reliable.
Safety / Content Filtering
Built-in moderation that blocks harmful, explicit, or policy-violating content. Most providers include content filtering at no extra cost. Filters can occasionally produce false positives, blocking legitimate requests. Some providers allow adjusting filter sensitivity.
Cost Optimization Terms
Prompt Caching
Storing frequently-used prompt prefixes (like system prompts) to avoid re-processing them. Cached tokens are charged at a discount (typically 50-90% off). Anthropic and OpenAI both support automatic prompt caching for repeated prefixes.
Model Routing
Automatically selecting the cheapest model that can handle each specific task. Simple queries go to budget models; complex reasoning goes to premium models. Can reduce costs by 60-80% compared to using a single premium model for everything.
Token Budget
A maximum token limit you set per request or per day to control costs. Setting token budgets prevents runaway costs from infinite loops or unexpectedly long responses. Combine with max_tokens and daily spending limits for full cost control.
Cost Per Request
The total cost of a single API call, including input tokens, output any any additional features (function calling, image processing). Calculated as: (input_tokens ร input_price + output_tokens ร output_price) รท 1,000,000.
Monthly Burn Rate
Your projected monthly API spend based on current usage patterns. Calculated as: average cost per request ร requests per day ร 30. Monitoring burn rate helps you stay within budget and identify cost spikes early.
Calculate your exact costs. Enter your usage patterns into our calculator to see your monthly burn rate and find the cheapest model for your needs.
Try the APIpulse Calculator or View Full Pricing Index