AI Benchmark Leaderboard: MMLU, HumanEval, MATH
Data Notice: Data points and API pricing referenced in this piece rely on the most recently available information and may include projections or prior-period numbers. Confirm current details with AI providers before subscribing.
AI Benchmark Leaderboard: MMLU, HumanEval, MATH
Benchmarks are the standardized tests of the AI world. They provide comparable scores across models, helping you evaluate capabilities objectively. This leaderboard tracks the most important benchmarks and keeps them updated.
Frontier Model MMLU Scores (General Knowledge)
Ai benchmark leaderboard: mmlu tool assessments draw on public benchmarks and our testing methodology. Individual results will vary by task requirements.
The Major Benchmarks Explained
MMLU (Massive Multitask Language Understanding): 57 subjects from elementary to professional level. Measures broad knowledge and reasoning. Think of it as a comprehensive general knowledge exam.
HumanEval: Coding benchmark that tests the ability to write correct Python functions from descriptions. Measures practical programming ability.
MATH: Competition-level math problems. Covers algebra, geometry, number theory, and calculus. One of the hardest benchmarks.
GSM8K: Grade-school math word problems. Easier than MATH but still requires multi-step reasoning.
GPQA (Graduate-Level Google-Proof QA): Expert-level science questions designed to be hard even for domain experts. Measures deep scientific reasoning.
SWE-bench: Tests the ability to resolve real GitHub issues. The most practical coding benchmark, measuring real-world development skills.
MT-Bench: Multi-turn conversation quality scored by GPT-4 as judge. Measures conversational ability across categories.
Comprehensive Leaderboard
Frontier Models
| Model | MMLU | HumanEval | MATH | GSM8K | GPQA | SWE-bench | MT-Bench |
|---|---|---|---|---|---|---|---|
| o3 | 91.2% | 92.7% | 88.9% | 97.2% | 73.4% | 48.2% | 9.1 |
| Claude Opus 4 | 89.4% | 90.2% | 78.3% | 95.1% | 65.1% | 51.5% | 9.2 |
| Gemini Ultra | 90.1% | 84.5% | 76.8% | 94.5% | 64.3% | 38.4% | 8.8 |
| GPT-4o | 88.7% | 87.1% | 74.6% | 93.8% | 61.8% | 42.8% | 9.0 |
Mid-Tier Models
| Model | MMLU | HumanEval | MATH | GSM8K | GPQA | SWE-bench | MT-Bench |
|---|---|---|---|---|---|---|---|
| Claude Sonnet 4 | 86.1% | 85.8% | 71.2% | 92.4% | 58.5% | 46.3% | 8.9 |
| o3-mini | 85.8% | 84.3% | 80.5% | 95.8% | 60.2% | 35.1% | 8.5 |
| Gemini Pro | 83.7% | 75.3% | 68.2% | 89.5% | 54.7% | 30.2% | 8.4 |
| Mistral Large | 84.0% | 75.6% | 58.1% | 87.2% | 48.9% | 28.5% | 8.5 |
Budget Models
| Model | MMLU | HumanEval | MATH | GSM8K | GPQA | SWE-bench | MT-Bench |
|---|---|---|---|---|---|---|---|
| GPT-4o mini | 82.0% | 78.5% | 62.4% | 88.1% | 47.2% | 28.3% | 8.1 |
| Claude Haiku 4 | 79.5% | 74.2% | 55.8% | 85.3% | 42.1% | 22.1% | 7.8 |
| Gemini Flash | 78.8% | 71.5% | 53.2% | 84.7% | 40.5% | 18.5% | 7.6 |
Open-Source Models
| Model | MMLU | HumanEval | MATH | GSM8K | GPQA | SWE-bench | MT-Bench |
|---|---|---|---|---|---|---|---|
| Llama 3 405B | 86.1% | 81.2% | 68.4% | 91.0% | 55.7% | 32.1% | 8.9 |
| Llama 3 70B | 82.0% | 72.3% | 55.2% | 86.5% | 48.3% | 25.4% | 8.4 |
| Mixtral 8x22B | 77.8% | 65.4% | 48.3% | 82.6% | 40.1% | 18.2% | 8.1 |
| Mistral 7B | 64.2% | 48.9% | 30.5% | 67.8% | 28.4% | 8.5% | 7.4 |
| Llama 3 8B | 68.4% | 55.1% | 35.7% | 72.1% | 30.2% | 12.3% | 7.6 |
All scores are approximate and based on publicly reported results. Testing methodologies vary across evaluators.
Category Leaders
| Category | Best Model | Score | Runner-Up |
|---|---|---|---|
| General Knowledge (MMLU) | o3 | 91.2% | Gemini Ultra (90.1%) |
| Coding (HumanEval) | o3 | 92.7% | Claude Opus 4 (90.2%) |
| Math (MATH) | o3 | 88.9% | Claude Opus 4 (78.3%) |
| Grade School Math (GSM8K) | o3 | 97.2% | Claude Opus 4 (95.1%) |
| Science (GPQA) | o3 | 73.4% | Claude Opus 4 (65.1%) |
| Real-World Coding (SWE-bench) | Claude Opus 4 | 51.5% | o3 (48.2%) |
| Conversation (MT-Bench) | Claude Opus 4 | 9.2 | o3 (9.1) |
| Best Open Source | Llama 3 405B | Various | Llama 3 70B |
How to Interpret Benchmarks
Benchmarks are useful but imperfect. They provide standardized comparison points but do not capture everything that matters in real-world use. A model that scores 2% higher on MMLU may not feel meaningfully better in practice.
Watch for these issues:
- Contamination. If benchmark questions appear in training data, scores are inflated.
- Overfitting. Models can be optimized for specific benchmarks at the expense of general performance.
- Task mismatch. Benchmarks may not represent your actual use case. A model that leads on MATH may not be best for your business writing needs.
- Methodology differences. Different evaluators may use different prompting strategies, yielding different scores for the same model.
The best evaluation is testing models on your own tasks. Benchmarks narrow the field; personal testing makes the final decision.
AI Model Playground: Side-by-Side Comparison
Benchmark Trends
- Reasoning models dominate quantitative benchmarks. o3’s lead on MATH and GPQA shows the power of deliberate reasoning, but at the cost of speed and price.
- SWE-bench diverges from HumanEval. Real-world coding (SWE-bench) and function-writing (HumanEval) measure different skills. Claude Opus 4 leads on SWE-bench despite trailing o3 on HumanEval.
- Open-source is closing the gap. Llama 3 405B matches or exceeds GPT-4-level scores from just two years ago.
- MMLU is plateauing. With top models scoring 88-91%, MMLU is approaching ceiling effects. Harder benchmarks (GPQA, MATH) better differentiate frontier models.
Key Takeaways
- o3 leads on most quantitative benchmarks (MMLU, HumanEval, MATH, GPQA) thanks to its reasoning approach.
- Claude Opus 4 leads on real-world coding (SWE-bench) and conversation quality (MT-Bench).
- Llama 3 405B is the strongest open-source model, competitive with last year’s frontier closed models.
- Benchmarks are useful for narrowing options but should not be the sole basis for model selection.
- The gap between budget models and frontier models is significant but narrowing.
Next Steps
- Test models on your own tasks rather than relying on benchmarks alone: AI Model Playground: Side-by-Side Comparison.
- Compare models across all dimensions: Complete Guide to AI Models in 2026: Which One Should You Use?.
- Take the model selector quiz for personalized recommendations: AI Model Selector Quiz: Which Model Fits Your Use Case?.
- Compare pricing alongside performance: AI API Pricing Comparison: Cost Per Million Tokens.
This article is published for informational purposes and is based on publicly available benchmarks and our own testing. AI model capabilities for this topic change frequently — verify current features and pricing with providers.