Data Notice: Figures, rates, and statistics cited in this article are based on the most recent available data at time of writing and may reflect projections or prior-year figures. Always verify current numbers with official sources before making financial, medical, or educational decisions.

AI Benchmark Leaderboard: MMLU, HumanEval, MATH

Benchmarks are the standardized tests of the AI world. They provide comparable scores across models, helping you evaluate capabilities objectively. This leaderboard tracks the most important benchmarks and keeps them updated.

AI model comparisons are based on publicly available benchmarks and editorial testing. Results may vary by use case.

The Major Benchmarks Explained

MMLU (Massive Multitask Language Understanding): 57 subjects from elementary to professional level. Measures broad knowledge and reasoning. Think of it as a comprehensive general knowledge exam.

HumanEval: Coding benchmark that tests the ability to write correct Python functions from descriptions. Measures practical programming ability.

MATH: Competition-level math problems. Covers algebra, geometry, number theory, and calculus. One of the hardest benchmarks.

GSM8K: Grade-school math word problems. Easier than MATH but still requires multi-step reasoning.

GPQA (Graduate-Level Google-Proof QA): Expert-level science questions designed to be hard even for domain experts. Measures deep scientific reasoning.

SWE-bench: Tests the ability to resolve real GitHub issues. The most practical coding benchmark, measuring real-world development skills.

MT-Bench: Multi-turn conversation quality scored by GPT-4 as judge. Measures conversational ability across categories.

Comprehensive Leaderboard

Frontier Models

Model	MMLU	HumanEval	MATH	GSM8K	GPQA	SWE-bench	MT-Bench
o3	91.2%	92.7%	88.9%	97.2%	73.4%	48.2%	9.1
Claude Opus 4	89.4%	90.2%	78.3%	95.1%	65.1%	51.5%	9.2
Gemini Ultra	90.1%	84.5%	76.8%	94.5%	64.3%	38.4%	8.8
GPT-4o	88.7%	87.1%	74.6%	93.8%	61.8%	42.8%	9.0

Mid-Tier Models

Model	MMLU	HumanEval	MATH	GSM8K	GPQA	SWE-bench	MT-Bench
Claude Sonnet 4	86.1%	85.8%	71.2%	92.4%	58.5%	46.3%	8.9
o3-mini	85.8%	84.3%	80.5%	95.8%	60.2%	35.1%	8.5
Gemini Pro	83.7%	75.3%	68.2%	89.5%	54.7%	30.2%	8.4
Mistral Large	84.0%	75.6%	58.1%	87.2%	48.9%	28.5%	8.5

Budget Models

Model	MMLU	HumanEval	MATH	GSM8K	GPQA	SWE-bench	MT-Bench
GPT-4o mini	82.0%	78.5%	62.4%	88.1%	47.2%	28.3%	8.1
Claude Haiku 4	79.5%	74.2%	55.8%	85.3%	42.1%	22.1%	7.8
Gemini Flash	78.8%	71.5%	53.2%	84.7%	40.5%	18.5%	7.6

Open-Source Models

Model	MMLU	HumanEval	MATH	GSM8K	GPQA	SWE-bench	MT-Bench
Llama 3 405B	86.1%	81.2%	68.4%	91.0%	55.7%	32.1%	8.9
Llama 3 70B	82.0%	72.3%	55.2%	86.5%	48.3%	25.4%	8.4
Mixtral 8x22B	77.8%	65.4%	48.3%	82.6%	40.1%	18.2%	8.1
Mistral 7B	64.2%	48.9%	30.5%	67.8%	28.4%	8.5%	7.4
Llama 3 8B	68.4%	55.1%	35.7%	72.1%	30.2%	12.3%	7.6

All scores are approximate and based on publicly reported results. Testing methodologies vary across evaluators.

Category Leaders

Category	Best Model	Score	Runner-Up
General Knowledge (MMLU)	o3	91.2%	Gemini Ultra (90.1%)
Coding (HumanEval)	o3	92.7%	Claude Opus 4 (90.2%)
Math (MATH)	o3	88.9%	Claude Opus 4 (78.3%)
Grade School Math (GSM8K)	o3	97.2%	Claude Opus 4 (95.1%)
Science (GPQA)	o3	73.4%	Claude Opus 4 (65.1%)
Real-World Coding (SWE-bench)	Claude Opus 4	51.5%	o3 (48.2%)
Conversation (MT-Bench)	Claude Opus 4	9.2	o3 (9.1)
Best Open Source	Llama 3 405B	Various	Llama 3 70B

How to Interpret Benchmarks

Benchmarks are useful but imperfect. They provide standardized comparison points but do not capture everything that matters in real-world use. A model that scores 2% higher on MMLU may not feel meaningfully better in practice.

Watch for these issues:

Contamination. If benchmark questions appear in training data, scores are inflated.
Overfitting. Models can be optimized for specific benchmarks at the expense of general performance.
Task mismatch. Benchmarks may not represent your actual use case. A model that leads on MATH may not be best for your business writing needs.
Methodology differences. Different evaluators may use different prompting strategies, yielding different scores for the same model.

The best evaluation is testing models on your own tasks. Benchmarks narrow the field; personal testing makes the final decision.

AI Model Playground: Side-by-Side Comparison

Benchmark Trends

Reasoning models dominate quantitative benchmarks. o3’s lead on MATH and GPQA shows the power of deliberate reasoning, but at the cost of speed and price.
SWE-bench diverges from HumanEval. Real-world coding (SWE-bench) and function-writing (HumanEval) measure different skills. Claude Opus 4 leads on SWE-bench despite trailing o3 on HumanEval.
Open-source is closing the gap. Llama 3 405B matches or exceeds GPT-4-level scores from just two years ago.
MMLU is plateauing. With top models scoring 88-91%, MMLU is approaching ceiling effects. Harder benchmarks (GPQA, MATH) better differentiate frontier models.

Key Takeaways

o3 leads on most quantitative benchmarks (MMLU, HumanEval, MATH, GPQA) thanks to its reasoning approach.
Claude Opus 4 leads on real-world coding (SWE-bench) and conversation quality (MT-Bench).
Llama 3 405B is the strongest open-source model, competitive with last year’s frontier closed models.
Benchmarks are useful for narrowing options but should not be the sole basis for model selection.
The gap between budget models and frontier models is significant but narrowing.

Next Steps

Test models on your own tasks rather than relying on benchmarks alone: AI Model Playground: Side-by-Side Comparison.
Compare models across all dimensions: Complete Guide to AI Models in 2026: Which One Should You Use?.
Take the model selector quiz for personalized recommendations: AI Model Selector Quiz: Which Model Fits Your Use Case?.
Compare pricing alongside performance: AI API Pricing Comparison: Cost Per Million Tokens.

This content is for informational purposes only and reflects independently researched comparisons. AI model capabilities change frequently — verify current specs with providers. Not professional advice.