← Back to blog

Guide April 25, 2026

LLM API Glossary: Every Term You Need to Know (2026)

Whether you're evaluating AI API providers for the first time or optimizing costs for an existing application, understanding the terminology is essential. This glossary covers every term you'll encounter in LLM API pricing, from basic concepts like tokens to advanced topics like batching and streaming.

Pricing & Billing Terms Token & Input Terms Model & Architecture Terms Performance & Limits Terms Feature Terms Cost Optimization Terms

Pricing & Billing Terms

Per-Token Pricing

The standard billing model for LLM APIs. You're charged based on the number of tokens processed — both input (what you send) and output (what the model generates). Prices are typically quoted per 1 million tokens.

Example: GPT-4o costs $2.50 per 1M input tokens and $10.00 per 1M output tokens.

Input Tokens

Tokens in your prompt — the text you send to the model. This includes your system prompt, user message, conversation history, and any context. Input tokens are always cheaper than output tokens.

Example: A 500-word prompt ≈ 650 input tokens. At GPT-4o pricing ($2.50/1M), that costs $0.0016 per request.

Output Tokens

Tokens the model generates in response. Output tokens are typically 3-10x more expensive than input tokens because generation requires more compute. The model generates one token at a time, sequentially.

Example: A 200-word response ≈ 260 output tokens. At GPT-4o pricing ($10/1M), that costs $0.0026 per request.

Cost Per 1M Tokens

The standard unit for comparing API prices. Instead of quoting per-token prices (which would be tiny fractions of a cent), providers quote cost per 1 million tokens. To calculate your actual cost: (tokens used ÷ 1,000,000) × price per 1M.

Example: 50,000 input tokens × ($2.50 ÷ 1,000,000) = $0.125

Free Tier

A usage allowance provided at no cost, typically with rate limits. Google Gemini offers the most generous free tier — unlimited requests with rate limiting. OpenAI and Anthropic offer $5 credits for new accounts.

Example: Gemini 2.0 Flash free tier: 15 RPM (requests per minute), 1M tokens/day.

Rate Limit

The maximum number of requests or tokens you can send within a time window (usually per minute or per day). Exceeding rate limits returns a 429 error. Limits vary by model, tier, and spending level.

Example: GPT-4o free tier: 500 RPM, 30,000 TPM (tokens per minute).

Credits

Pre-paid balance applied against API usage. Most providers offer initial credits ($5-10) for new accounts. Credits expire after a set period (usually 3-12 months) and are consumed before charging your payment method.

Token & Input Terms

Token

The fundamental unit of text processing in LLMs. A token is roughly 4 characters or 0.75 words in English. Tokens can be whole words, parts of words, or individual characters. The exact tokenization varies by model (GPT uses tiktoken, Claude uses its own tokenizer).

Rule of thumb: 1 word ≈ 1.3 tokens. "Hello, world!" = 4 tokens. A 1,000-word article ≈ 1,300 tokens.

Context Window

The maximum number of tokens a model can process in a single request — including both input and output. This limits how much text you can send and receive. Larger context windows allow processing longer documents but cost more.

Range in 2026: 32K (Mistral Small 4) to 1M (Gemini 2.5 Pro). Most models: 128K-256K.

System Prompt

A special instruction sent before the user's message that defines the model's behavior, role, and constraints. System prompts are included in input token count and persist across the conversation. They're typically more expensive than user messages because they're sent with every request.

Cost impact: A 500-token system prompt sent 1,000 times/day = 500K input tokens/day = $1.25/day on GPT-4o.

Conversation History

The accumulated messages in a chat session. Each request typically includes the full conversation history, meaning costs grow linearly with conversation length. A 20-message conversation might use 10,000+ input tokens per request.

Optimization: Limit history to the last 5-10 messages, or use summarization to compress older messages.

Max Tokens (Output Limit)

A parameter you set to limit how many tokens the model can generate in its response. Setting this too low truncates responses; setting it too high wastes money if the model generates unnecessary text. Typical values: 256-4,096 tokens.

Tip: For chatbots, 512-1,024 tokens is usually sufficient. For code generation, 2,048-4,096.

Embedding

A numerical vector representation of text, used for semantic search, classification, and RAG systems. Embedding models are separate from generative models and have their own pricing (typically much cheaper). Embeddings are generated once and reused.

Pricing: OpenAI text-embedding-3: $0.02/1M tokens. Google: free (rate limited).

Model & Architecture Terms

Foundation Model

A large pre-trained model that can be adapted to many tasks. Examples: GPT-4o, Claude Sonnet 4, Gemini 2.5 Pro. Foundation models are trained on vast datasets and can perform reasoning, coding, analysis, and generation without task-specific training.

Model Tier

A classification of models by capability and price. Premium models (Claude 4 Opus, GPT-5, Gemini 2.5 Pro) offer maximum quality at higher cost. Budget models (Gemini 2.0 Flash, GPT-4o mini, Mistral Small 4) offer good quality at minimal cost. Mid-tier models (Claude Sonnet 4, GPT-4o) balance both.

Open Source Model

A model whose weights are publicly available, allowing self-hosting or use through third-party APIs. Examples: Llama 3.1 (Meta), Mistral (Mistral AI). Open source models can be cheaper at scale but require infrastructure to run.

Cost comparison: Llama 3.1 8B via Together.ai: $0.18/1M tokens. Self-hosted on GPU: ~$1.50/hour (breaks even at ~2.7M requests/day).

Temperature

A parameter (0.0-2.0) controlling randomness in responses. Lower values (0.0-0.3) produce more deterministic, focused outputs. Higher values (0.7-1.0) produce more creative, varied outputs. Temperature doesn't affect pricing but impacts output length and quality.

Function Calling / Tool Use

The ability of a model to generate structured JSON output that triggers external functions or API calls. This enables LLMs to interact with databases, APIs, and services. Not all models support this — it's strongest in OpenAI and Anthropic models.

Multimodal

Models that can process multiple types of input — text, images, audio, and sometimes video. GPT-4o and Gemini 2.5 Pro are fully multimodal. Image inputs cost more than text (typically 2-5x per token).

Performance & Limits Terms

Latency

The time between sending a request and receiving the first token of response (Time to First Token, or TTFT). Lower latency = faster perceived response. Budget models are typically faster than premium models. Streaming reduces perceived latency by showing tokens as they generate.

Typical TTFT: Gemini 2.0 Flash: ~200ms. GPT-4o: ~500ms. Claude 4 Opus: ~800ms.

Throughput

The number of tokens generated per second (tokens/sec). Higher throughput = faster complete responses. Important for batch processing and high-volume applications. Throughput varies by model, load, and response length.

Streaming

Sending response tokens to the client as they're generated, rather than waiting for the complete response. Streaming doesn't reduce cost (you still pay for all tokens) but dramatically improves perceived performance. Most providers support Server-Sent Events (SSE) for streaming.

Batch API

A bulk processing endpoint that accepts multiple requests and processes them asynchronously, typically at 50% discount. Ideal for non-time-sensitive workloads like data processing, content generation, and analysis. OpenAI and Anthropic both offer batch APIs.

Savings: GPT-4o batch: $1.25 input / $5.00 output (50% off standard pricing).

Timeout

The maximum time a provider will wait for a response before returning an error. Longer responses and higher max_tokens settings increase the chance of timeouts. Typical timeout: 60-120 seconds. Set appropriate timeouts in your application.

Feature Terms

RAG (Retrieval-Augmented Generation)

A technique that combines document retrieval with LLM generation. Relevant documents are retrieved from a vector database and included in the prompt, allowing the model to answer questions about specific data without fine-tuning. Costs include: embedding, vector search, and generation.

Cost breakdown: Embedding: $0.02/1M tokens. Vector search: ~$0.0001/query. Generation: varies by model.

Fine-Tuning

Training a model on your specific data to improve performance for your use case. Fine-tuning has upfront training costs and creates a custom model with its own inference pricing. For most applications, prompt engineering + RAG is more cost-effective than fine-tuning.

Prompt Engineering

The practice of designing inputs to get better, cheaper outputs from LLMs. Techniques include: clear instructions, few-shot examples, structured output formats, and chain-of-thought reasoning. Good prompt engineering can reduce token usage by 30-50%.

Structured Output

Forcing the model to respond in a specific format (JSON, XML, etc.). Structured output reduces post-processing costs and improves reliability. Some providers (OpenAI, Anthropic) have dedicated structured output modes that are more reliable.

Safety / Content Filtering

Built-in moderation that blocks harmful, explicit, or policy-violating content. Most providers include content filtering at no extra cost. Filters can occasionally produce false positives, blocking legitimate requests. Some providers allow adjusting filter sensitivity.

Cost Optimization Terms

Prompt Caching

Storing frequently-used prompt prefixes (like system prompts) to avoid re-processing them. Cached tokens are charged at a discount (typically 50-90% off). Anthropic and OpenAI both support automatic prompt caching for repeated prefixes.

Savings: A 1,000-token system prompt cached at 90% discount: $0.00025 instead of $0.0025 per request.

Model Routing

Automatically selecting the cheapest model that can handle each specific task. Simple queries go to budget models; complex reasoning goes to premium models. Can reduce costs by 60-80% compared to using a single premium model for everything.

Example: Classification → Gemini Flash ($0.10/1M). Code review → Claude Sonnet ($3/1M). Architecture → Claude Opus ($15/1M).

Token Budget

A maximum token limit you set per request or per day to control costs. Setting token budgets prevents runaway costs from infinite loops or unexpectedly long responses. Combine with max_tokens and daily spending limits for full cost control.

Cost Per Request

The total cost of a single API call, including input tokens, output any any additional features (function calling, image processing). Calculated as: (input_tokens × input_price + output_tokens × output_price) ÷ 1,000,000.

Example: GPT-4o request with 500 input + 200 output tokens: (500 × $2.50 + 200 × $10.00) ÷ 1M = $0.00325

Monthly Burn Rate

Your projected monthly API spend based on current usage patterns. Calculated as: average cost per request × requests per day × 30. Monitoring burn rate helps you stay within budget and identify cost spikes early.

Example: $0.003/request × 1,000 requests/day × 30 days = $90/month.

Calculate your exact costs. Enter your usage patterns into our calculator to see your monthly burn rate and find the cheapest model for your needs.

Try the APIpulse Calculator or View Full Pricing Index