Comparisons

AI Benchmark Leaderboard: MMLU, HumanEval, MATH

Updated 2026-03-10

Data Notice: Figures, rates, and statistics cited in this article are based on the most recent available data at time of writing and may reflect projections or prior-year figures. Always verify current numbers with official sources before making financial, medical, or educational decisions.

AI Benchmark Leaderboard: MMLU, HumanEval, MATH

Benchmarks are the standardized tests of the AI world. They provide comparable scores across models, helping you evaluate capabilities objectively. This leaderboard tracks the most important benchmarks and keeps them updated.

AI model comparisons are based on publicly available benchmarks and editorial testing. Results may vary by use case.

The Major Benchmarks Explained

MMLU (Massive Multitask Language Understanding): 57 subjects from elementary to professional level. Measures broad knowledge and reasoning. Think of it as a comprehensive general knowledge exam.

HumanEval: Coding benchmark that tests the ability to write correct Python functions from descriptions. Measures practical programming ability.

MATH: Competition-level math problems. Covers algebra, geometry, number theory, and calculus. One of the hardest benchmarks.

GSM8K: Grade-school math word problems. Easier than MATH but still requires multi-step reasoning.

GPQA (Graduate-Level Google-Proof QA): Expert-level science questions designed to be hard even for domain experts. Measures deep scientific reasoning.

SWE-bench: Tests the ability to resolve real GitHub issues. The most practical coding benchmark, measuring real-world development skills.

MT-Bench: Multi-turn conversation quality scored by GPT-4 as judge. Measures conversational ability across categories.

Comprehensive Leaderboard

Frontier Models

ModelMMLUHumanEvalMATHGSM8KGPQASWE-benchMT-Bench
o391.2%92.7%88.9%97.2%73.4%48.2%9.1
Claude Opus 489.4%90.2%78.3%95.1%65.1%51.5%9.2
Gemini Ultra90.1%84.5%76.8%94.5%64.3%38.4%8.8
GPT-4o88.7%87.1%74.6%93.8%61.8%42.8%9.0

Mid-Tier Models

ModelMMLUHumanEvalMATHGSM8KGPQASWE-benchMT-Bench
Claude Sonnet 486.1%85.8%71.2%92.4%58.5%46.3%8.9
o3-mini85.8%84.3%80.5%95.8%60.2%35.1%8.5
Gemini Pro83.7%75.3%68.2%89.5%54.7%30.2%8.4
Mistral Large84.0%75.6%58.1%87.2%48.9%28.5%8.5

Budget Models

ModelMMLUHumanEvalMATHGSM8KGPQASWE-benchMT-Bench
GPT-4o mini82.0%78.5%62.4%88.1%47.2%28.3%8.1
Claude Haiku 479.5%74.2%55.8%85.3%42.1%22.1%7.8
Gemini Flash78.8%71.5%53.2%84.7%40.5%18.5%7.6

Open-Source Models

ModelMMLUHumanEvalMATHGSM8KGPQASWE-benchMT-Bench
Llama 3 405B86.1%81.2%68.4%91.0%55.7%32.1%8.9
Llama 3 70B82.0%72.3%55.2%86.5%48.3%25.4%8.4
Mixtral 8x22B77.8%65.4%48.3%82.6%40.1%18.2%8.1
Mistral 7B64.2%48.9%30.5%67.8%28.4%8.5%7.4
Llama 3 8B68.4%55.1%35.7%72.1%30.2%12.3%7.6

All scores are approximate and based on publicly reported results. Testing methodologies vary across evaluators.

Category Leaders

CategoryBest ModelScoreRunner-Up
General Knowledge (MMLU)o391.2%Gemini Ultra (90.1%)
Coding (HumanEval)o392.7%Claude Opus 4 (90.2%)
Math (MATH)o388.9%Claude Opus 4 (78.3%)
Grade School Math (GSM8K)o397.2%Claude Opus 4 (95.1%)
Science (GPQA)o373.4%Claude Opus 4 (65.1%)
Real-World Coding (SWE-bench)Claude Opus 451.5%o3 (48.2%)
Conversation (MT-Bench)Claude Opus 49.2o3 (9.1)
Best Open SourceLlama 3 405BVariousLlama 3 70B

How to Interpret Benchmarks

Benchmarks are useful but imperfect. They provide standardized comparison points but do not capture everything that matters in real-world use. A model that scores 2% higher on MMLU may not feel meaningfully better in practice.

Watch for these issues:

  • Contamination. If benchmark questions appear in training data, scores are inflated.
  • Overfitting. Models can be optimized for specific benchmarks at the expense of general performance.
  • Task mismatch. Benchmarks may not represent your actual use case. A model that leads on MATH may not be best for your business writing needs.
  • Methodology differences. Different evaluators may use different prompting strategies, yielding different scores for the same model.

The best evaluation is testing models on your own tasks. Benchmarks narrow the field; personal testing makes the final decision.

AI Model Playground: Side-by-Side Comparison

  • Reasoning models dominate quantitative benchmarks. o3’s lead on MATH and GPQA shows the power of deliberate reasoning, but at the cost of speed and price.
  • SWE-bench diverges from HumanEval. Real-world coding (SWE-bench) and function-writing (HumanEval) measure different skills. Claude Opus 4 leads on SWE-bench despite trailing o3 on HumanEval.
  • Open-source is closing the gap. Llama 3 405B matches or exceeds GPT-4-level scores from just two years ago.
  • MMLU is plateauing. With top models scoring 88-91%, MMLU is approaching ceiling effects. Harder benchmarks (GPQA, MATH) better differentiate frontier models.

Key Takeaways

  • o3 leads on most quantitative benchmarks (MMLU, HumanEval, MATH, GPQA) thanks to its reasoning approach.
  • Claude Opus 4 leads on real-world coding (SWE-bench) and conversation quality (MT-Bench).
  • Llama 3 405B is the strongest open-source model, competitive with last year’s frontier closed models.
  • Benchmarks are useful for narrowing options but should not be the sole basis for model selection.
  • The gap between budget models and frontier models is significant but narrowing.

Next Steps


This content is for informational purposes only and reflects independently researched comparisons. AI model capabilities change frequently — verify current specs with providers. Not professional advice.