← Back to blog

AI API Glossary

Every term you need to understand LLM pricing, from tokens to fine-tuning. Click any term for a clear definition and real-world examples.

A B C D E F G H I J K L M N O P Q R S T U V W

A

API (Application Programming Interface)

A way for software to communicate with an LLM service. You send a request (your prompt) via HTTP and receive a response (the model's output). API pricing is based on the tokens processed in each request.

B

Batch API

A bulk processing endpoint that accepts multiple requests and processes them asynchronously, typically at 50% discount. Ideal for non-time-sensitive workloads like data processing and content generation.

Example: GPT-4o batch: $1.25 input / $5.00 output (50% off standard pricing).

Context Window

The maximum number of tokens a model can process in a single request — both input and output combined. Larger windows allow processing longer documents but cost more per request.

Range in 2026: 32K (Mistral Small 4) to 10M (Llama 4 Scout). Most models: 128K–256K.

Credits

Pre-paid balance applied against API usage. Most providers offer initial credits ($5–10) for new accounts. Credits expire after a set period (usually 3–12 months) and are consumed before charging your payment method.

E

Embedding

A numerical vector representation of text, used for semantic search, classification, and RAG systems. Embedding models are separate from generative models and are much cheaper — typically $0.02–0.10 per 1M tokens.

Example: OpenAI text-embedding-3: $0.02/1M tokens. Google: free (rate limited).

Endpoint

The URL you send API requests to. Each model or feature has its own endpoint. For example, OpenAI's chat completions endpoint is api.openai.com/v1/chat/completions.

F

Foundation Model

A large pre-trained model that can be adapted to many tasks. Examples: GPT-4o, Claude Sonnet 4, Gemini 2.5 Pro. Foundation models are trained on vast datasets and perform reasoning, coding, analysis, and generation without task-specific training.

Fine-Tuning

Training a model on your specific data to improve performance for your use case. Has upfront training costs and creates a custom model with its own inference pricing. For most applications, prompt engineering + RAG is more cost-effective.

Free Tier

A usage allowance provided at no cost, typically with rate limits. Google Gemini offers the most generous free tier — unlimited requests with rate limiting. OpenAI and Anthropic offer $5 credits for new accounts.

Function Calling / Tool Use

The ability of a model to generate structured JSON output that triggers external functions or API calls. This enables LLMs to interact with databases, APIs, and services. Strongest in OpenAI and Anthropic models.

I

Input Tokens

Tokens in your prompt — the text you send to the model. Includes system prompt, user message, conversation history, and any context. Input tokens are always cheaper than output tokens.

Example: A 500-word prompt ≈ 650 input tokens. At GPT-4o pricing ($2.50/1M), that costs $0.0016 per request.

L

Latency

The time between sending a request and receiving the first token of response (TTFT). Lower latency = faster perceived response. Budget models are typically faster than premium models.

Typical TTFT: Gemini Flash: ~180ms. GPT-4o: ~350ms. Claude 4 Opus: ~800ms.

M

Max Tokens (Output Limit)

A parameter you set to limit how many tokens the model can generate. Setting this too low truncates responses; setting it too high wastes money. Typical values: 256–4,096 tokens.

Tip: For chatbots, 512–1,024 tokens is usually sufficient. For code generation, 2,048–4,096.

Model Routing

Automatically selecting the cheapest model that can handle each specific task. Simple queries go to budget models; complex reasoning goes to premium models. Can reduce costs by 60–80%.

Example: Classification → Gemini Flash ($0.10/1M). Code review → Claude Sonnet ($3/1M). Architecture → Claude Opus ($15/1M).

Model Tier

A classification of models by capability and price. Premium (Claude 4 Opus, GPT-5): maximum quality. Budget (Gemini Flash, GPT-4o mini): minimal cost. Mid-tier (Claude Sonnet 4, GPT-4o): balanced.

Multimodal

Models that process multiple input types — text, images, audio, and sometimes video. GPT-4o and Gemini 2.5 Pro are fully multimodal. Image inputs cost more than text (typically 2–5x per token).

O

Open Source Model

A model whose weights are publicly available, allowing self-hosting or use through third-party APIs. Examples: Llama (Meta), Mistral. Can be cheaper at scale but require infrastructure to run.

Cost comparison: Llama 3.1 8B via Together.ai: $0.18/1M tokens. Self-hosted: ~$1.50/hour GPU.

Output Tokens

Tokens the model generates in response. Typically 3–10x more expensive than input tokens because generation requires more compute. The model generates one token at a time, sequentially.

Example: A 200-word response ≈ 260 output tokens. At GPT-4o pricing ($10/1M), that costs $0.0026 per request.

P

Per-Token Pricing

The standard billing model for LLM APIs. You're charged based on the number of tokens processed — both input and output. Prices are quoted per 1 million tokens.

Example: GPT-4o: $2.50 per 1M input tokens, $10.00 per 1M output tokens.

Prompt Caching

Storing frequently-used prompt prefixes (like system prompts) to avoid re-processing them. Cached tokens are charged at a 50–90% discount. Anthropic and OpenAI both support automatic prompt caching.

Savings: A 1,000-token system prompt cached at 90% discount: $0.00025 instead of $0.0025 per request.

Prompt Engineering

The practice of designing inputs to get better, cheaper outputs from LLMs. Techniques include clear instructions, few-shot examples, structured output formats, and chain-of-thought reasoning. Can reduce token usage by 30–50%.

R

RAG (Retrieval-Augmented Generation)

A technique combining document retrieval with LLM generation. Relevant documents are retrieved from a vector database and included in the prompt, letting the model answer questions about specific data without fine-tuning.

Cost breakdown: Embedding: $0.02/1M tokens. Vector search: ~$0.0001/query. Generation: varies by model.

Rate Limit

The maximum number of requests or tokens you can send within a time window (usually per minute or per day). Exceeding rate limits returns a 429 error. Limits vary by model, tier, and spending level.

S

Streaming

Sending response tokens to the client as they're generated, rather than waiting for the complete response. Doesn't reduce cost but dramatically improves perceived performance. Most providers support SSE (Server-Sent Events).

Structured Output

Forcing the model to respond in a specific format (JSON, XML, etc.). Reduces post-processing costs and improves reliability. Some providers have dedicated structured output modes.

System Prompt

A special instruction sent before the user's message that defines the model's behavior, role, and constraints. Included in input token count and persists across the conversation. Sent with every request — can be expensive.

Cost impact: A 500-token system prompt sent 1,000 times/day = 500K input tokens/day = $1.25/day on GPT-4o.

T

Temperature

A parameter (0.0–2.0) controlling randomness. Lower values (0.0–0.3) = deterministic, focused outputs. Higher values (0.7–1.0) = creative, varied outputs. Doesn't affect pricing but impacts output length and quality.

Throughput

The number of tokens generated per second (tok/s). Higher throughput = faster complete responses. Important for batch processing and high-volume applications.

Token

The fundamental unit of text processing. Roughly 4 characters or 0.75 words in English. Tokens can be whole words, parts of words, or individual characters. Exact tokenization varies by model.

Rule of thumb: 1 word ≈ 1.3 tokens. "Hello, world!" = 4 tokens. A 1,000-word article ≈ 1,300 tokens.

Token Budget

A maximum token limit you set per request or per day to control costs. Prevents runaway costs from infinite loops or unexpectedly long responses. Combine with max_tokens and daily spending limits.

Cost Per 1M Tokens

The standard unit for comparing API prices. Instead of quoting per-token prices (tiny fractions of a cent), providers quote cost per 1 million tokens. To calculate: (tokens ÷ 1,000,000) × price per 1M.

Example: 50,000 input tokens × ($2.50 ÷ 1,000,000) = $0.125

Cost Per Request

The total cost of a single API call. Calculated as: (input_tokens × input_price + output_tokens × output_price) ÷ 1,000,000.

Example: GPT-4o with 500 input + 200 output tokens: (500 × $2.50 + 200 × $10.00) ÷ 1M = $0.00325

Monthly Burn Rate

Your projected monthly API spend based on current usage. Calculated as: average cost per request × requests per day × 30. Monitoring burn rate helps you stay within budget.

Example: $0.003/request × 1,000 requests/day × 30 days = $90/month.

Calculate your exact costs. Enter your usage patterns into our calculator to see your monthly burn rate and find the cheapest model for your needs.

Try the APIpulse Calculator or View Full Pricing Index