AI Benchmarks Explained: What MMLU, HumanEval, and GSM8K Actually Measure

Every time a new AI model launches, the announcement leads with benchmark scores. "93.2% on MMLU!" "92.0% on HumanEval!" These numbers shape purchasing decisions, drive media coverage, and influence billions of dollars in AI investment. But most people using these numbers to make decisions do not know what the benchmarks actually measure, how they are administered, or why they can be misleading. This guide fixes that.

MMLU: Massive Multitask Language Understanding

What It Measures MMLU is a multiple-choice exam covering 57 academic subjects, from abstract algebra to world religions. It includes 14,042 questions organized into four difficulty levels: elementary, high school, college, and professional. The benchmark tests breadth of knowledge and the ability to apply that knowledge to answer structured questions.

How It Works The model receives a question and four answer choices (A, B, C, D). It selects one. The score is the percentage of correct answers. Most evaluations report the overall average across all 57 subjects, though subject-level breakdowns are more informative.

What It Tells You A high MMLU score means the model has broad factual knowledge and can reason about academic topics at a college level. It is a good proxy for "general intelligence" in the traditional sense -- knowing things across many domains.

Limitations MMLU is a multiple-choice test. It does not measure the ability to generate coherent long-form text, write code, follow complex instructions, or engage in multi-turn reasoning. A model can score well on MMLU by being a good pattern matcher on structured questions without being genuinely useful for open-ended tasks.

MMLU is also approaching saturation. Top models score 88-92%, leaving little room for meaningful differentiation. A 1-2% difference in MMLU score is unlikely to translate to a noticeable difference in real-world performance.

Variant: MMLU-Pro MMLU-Pro is a harder version with 10 answer choices instead of 4 and more complex reasoning requirements. It offers better discrimination between frontier models. Current top scores are in the 70-75% range, giving more room for meaningful comparisons.

HumanEval: Code Generation

What It Measures HumanEval is a benchmark of 164 Python programming problems. Each problem includes a function signature, a docstring description, and a set of unit tests. The model must generate a correct function implementation.

How It Works The model receives a function signature and docstring, then generates the function body. The generated code is executed against the hidden unit tests. The score (pass@1) is the percentage of problems where the first generated solution passes all tests.

What It Tells You HumanEval measures the ability to write correct, self-contained Python functions from specifications. It is a good indicator of basic code generation capability.

Limitations HumanEval problems are relatively simple -- they are similar to LeetCode easy-to-medium problems. They test isolated function generation, not the ability to navigate a large codebase, understand architecture, debug existing code, or implement multi-file changes. A model can score 90%+ on HumanEval while struggling with real-world software engineering tasks.

Better Alternative: SWE-bench SWE-bench Verified tests models on real GitHub issues from popular open source projects. The model must understand the codebase, identify the relevant files, and generate a correct patch. This is dramatically harder and more representative of real coding work. Top scores on SWE-bench Verified are in the 40-50% range, suggesting significant room for improvement. We track SWE-bench scores on our leaderboard.

GSM8K: Grade School Math

What It Measures GSM8K contains 8,500 grade-school-level math word problems. Despite the "grade school" label, these problems require multi-step reasoning -- typically 2-8 sequential arithmetic operations embedded in a word problem.

How It Works The model receives a word problem and must produce the correct numerical answer. Most evaluations use chain-of-thought prompting, where the model shows its reasoning steps before providing the final answer.

What It Tells You GSM8K measures basic quantitative reasoning and the ability to decompose a word problem into sequential mathematical steps. It is a good baseline test for mathematical capability.

Limitations GSM8K problems are genuinely elementary. Top models score 95%+ on this benchmark, making it nearly saturated. The problems do not involve advanced mathematics -- no calculus, linear algebra, or statistics. A model scoring 95% on GSM8K may still struggle with college-level math.

Better Alternative: MATH The MATH benchmark contains 12,500 competition-level mathematics problems across seven subjects (algebra, number theory, counting, probability, geometry, precalculus, intermediate algebra). Top scores range from 70-95% depending on the model, making it a much better discriminator. DeepSeek-R1 and o1 significantly outperform standard models on MATH thanks to their chain-of-thought reasoning approach.

GPQA Diamond: Graduate-Level Reasoning

What It Measures GPQA (Graduate-Level Google-Proof Questions and Answers) Diamond is a set of 198 extremely difficult questions written by PhD-level domain experts in physics, chemistry, and biology. These questions are designed to be "Google-proof" -- you cannot find the answers through simple search.

How It Works Multiple-choice format, but the questions require deep domain expertise and multi-step reasoning to answer correctly. Even human domain experts only achieve about 65-75% accuracy on questions outside their specialty.

What It Tells You GPQA Diamond is one of the best benchmarks for measuring genuine deep reasoning capability. A high score indicates the model can engage in expert-level analysis, not just pattern matching on common knowledge.

Limitations The dataset is small (198 questions), which means individual scores can be noisy. It also focuses on STEM subjects, so it does not assess reasoning in humanities, social sciences, or practical domains.

MBPP: Mostly Basic Python Programming

What It Measures MBPP contains 974 crowd-sourced Python programming problems, ranging from simple string manipulation to more complex algorithmic tasks. It complements HumanEval with a larger and more diverse set of problems.

How It Works Similar to HumanEval: the model receives a description and must generate working Python code that passes the associated test cases.

What It Tells You MBPP is a broader test of Python programming ability than HumanEval. The larger problem set and wider difficulty range make it a more reliable indicator of general coding proficiency.

ARC-Challenge: Common-Sense Reasoning

What It Measures The AI2 Reasoning Challenge (ARC) tests science-level reasoning with questions drawn from grade 3-9 science exams. The "Challenge" set focuses specifically on questions that simple retrieval and co-occurrence methods fail to answer.

What It Tells You ARC-Challenge measures common-sense and scientific reasoning ability. Unlike MMLU, which tests factual knowledge, ARC-Challenge requires applying knowledge to novel scenarios.

MT-Bench: Multi-Turn Conversation

What It Measures MT-Bench evaluates models on multi-turn conversations across 80 carefully designed prompts spanning writing, reasoning, math, coding, extraction, STEM, and humanities. A strong model (typically GPT-4) judges the quality of responses on a 1-10 scale.

What It Tells You MT-Bench is one of the best benchmarks for conversational AI quality. It measures how well a model handles follow-up questions, maintains context, and produces helpful responses across diverse topics.

Limitations MT-Bench uses an AI model as a judge, which introduces bias toward outputs that are stylistically similar to the judge model. Models trained to produce GPT-4-like outputs may score disproportionately well.

Why Benchmarks Do Not Tell the Whole Story

The Overfitting Problem Model developers know which benchmarks they will be evaluated on. There is strong incentive to optimize for benchmark performance, which can come at the expense of broader capabilities. A model fine-tuned to maximize MMLU scores may sacrifice performance on tasks that MMLU does not measure.

The Evaluation Gap Benchmarks test isolated capabilities. Real-world applications combine multiple capabilities: understanding context, following instructions, generating structured output, handling edge cases, recovering from errors, and maintaining consistency across long interactions. No single benchmark captures this.

The Contamination Problem As models are trained on increasingly large portions of the internet, there is a risk that benchmark questions appear in training data. This inflates scores without improving genuine capability. Researchers attempt to mitigate this through data decontamination, but it is an ongoing challenge.

What Actually Matters For selecting a model, benchmarks are a useful starting filter, not a final decision. After using benchmarks to narrow your candidates to 2-3 models, test those models on your actual data with your actual prompts. The model that performs best on your task may not be the one with the highest benchmark scores.

How to Use Benchmarks Effectively

1. Use benchmarks to establish a shortlist, not to make a final decision 2. Look at task-relevant benchmarks: HumanEval/SWE-bench for coding, MATH for quantitative work, MT-Bench for conversation 3. Ignore small differences (1-2%) -- they are within noise 4. Pay attention to trends across multiple benchmarks, not individual scores 5. Check if the benchmark is saturated (scores above 90-95%) -- if so, it is not meaningfully differentiating models 6. Always validate with your own evaluation set

Explore benchmark scores for every model in our directory on our leaderboard page, and use our comparison tool to see models head-to-head on the metrics that matter for your use case.