AI Model Benchmark Scores
Compare 34 LLMs by MMLU, HumanEval, MATH, and Arena Elo scores. Find the best model for your use case — ranked by actual benchmark performance.
MMLU
HumanEval
MATH
Arena Elo
Composite
MMLU Scores — All Models
Premium
Mid
Budget
| # | Model | Tier | MMLU | HumanEval | MATH | Arena Elo | Composite | Cost $/1M |
|---|
Head-to-Head Benchmark Comparison
Understanding AI Model Benchmarks
Benchmark scores provide a standardized way to compare AI model capabilities. Here's what each benchmark measures:
- MMLU (Massive Multitask Language Understanding): Tests knowledge across 57 subjects including math, science, law, history, and humanities. Higher = broader general knowledge. Scale: 0-100.
- HumanEval (Code Generation): Measures ability to generate correct Python functions from docstrings. Higher = better coding ability. Scale: 0-100 (pass@1).
- MATH (Mathematical Reasoning): Tests ability to solve competition-level math problems. Higher = stronger mathematical reasoning. Scale: 0-100.
- Arena Elo (Chatbot Arena): ELO rating from blind human comparisons in the Chatbot Arena. Higher = humans prefer this model's responses more often. Scale: 1000-1400+.
- Composite Score: Weighted average of all four benchmarks, normalized to 0-100. Useful for overall model ranking.
Important: Benchmark scores are estimates based on published results and community data. Actual performance varies by task, prompt, and use case. Always test with your specific workload before committing to a model.