MMLU for Large Language Models: What It Measures and What It Misses
For years, if you wanted to know how smart an Large Language Model is, you looked at its score on the Massive Multitask Language Understanding (MMLU) benchmark. It was the gold standard. The definitive report card. But as of mid-2026, that report card has a problem pages. Scores are hitting ceilings, questions are leaking into training data, and the test itself contains errors. So, what does MMLU actually tell us about modern AI? And more importantly, what critical capabilities is it failing to measure?
To understand where we stand, we need to look past the single-digit percentage points that dominate marketing slides. We need to dissect the anatomy of the benchmark, see why it worked in 2020, and why it’s struggling to keep up with models like Claude Opus 4.5 or Gemini 3 Pro today.
The Origin Story: Why MMLU Became the King
Back in September 2020, researchers led by Dan Hendrycks at UC Berkeley released MMLU. Before this, evaluating AI was messy. You had narrow tests for specific tasks-translation here, sentiment analysis there. There was no unified way to say, "Can this model do general academic work?"
MMLU changed that. It packaged 15,908 multiple-choice questions across 57 subjects. These weren't random trivia. They covered everything from elementary math and US history to professional law and medical diagnosis. The goal was simple: mimic the breadth of human education. If a model could ace these, it arguably possessed broad, expert-level knowledge.
When GPT-3 175B first took the test, it scored 43.9%. Remember, random guessing on a four-option question gets you 25%. So GPT-3 was better than luck, but far from an expert. Human experts, by comparison, score around 89.8%. That gap of nearly 46 percentage points gave researchers a clear mountain to climb. For the next few years, every new model release was measured against this trajectory.
| Model | Release Year | MMLU Score | Context |
|---|---|---|---|
| GPT-3 175B | 2020 | 43.9% | Baseline; significantly below human expert level. |
| GPT-4 | 2023 | 86.4% | Closed the gap dramatically; near-human performance. |
| Claude 3 Opus | 2024 | 86.8% | Competitive frontier model; strong in humanities. |
| Gemini Ultra | 2024 | 83.7% | Strong multimodal capabilities; solid general knowledge. |
| Frontier Models (2025) | 2025 | >90% | Saturation point reached; scores no longer differentiate well. |
What MMLU Actually Measures
Let’s be clear about what a high MMLU score proves. It proves that a model has ingested a massive amount of factual information and can retrieve it when prompted. It shows that the model understands the structure of academic exams. It demonstrates cross-domain generalization-the ability to switch from solving a calculus problem to analyzing a Shakespearean sonnet without breaking a sweat.
The benchmark is structured in five difficulty tiers:
- Elementary & Middle School: Basic arithmetic, geography, biology.
- High School: Calculus, advanced chemistry, literature.
- College: Specialized undergraduate knowledge.
- Professional: Medical diagnosis, legal reasoning, scientific research.
This tiered approach was brilliant because it allowed us to see *where* a model struggled. Early models, for instance, bombed on professional law and moral scenarios. They were "near random" on these topics. This told us that scaling up parameters improved rote memorization faster than it improved nuanced ethical or logical reasoning. MMLU successfully highlighted these blind spots for years.
The Cracks in the Foundation: What MMLU Misses
But here is the problem: As models got smarter, MMLU got dumber. Or rather, its utility evaporated. By 2025, the original MMLU was being phased out in many serious evaluations. Why? Three major reasons.
1. Data Contamination (The Cheating Problem)
MMLU is public. It has been downloaded over 100 million times. When companies train their next-generation models, they scrape the internet. If MMLU questions and answers are on the internet, the model sees them during training. It doesn’t solve the problem; it memorizes the answer key.
When a model scores 92% on MMLU today, is it because it understands quantum physics? Or did it just recognize the question from its training data? We often don’t know. This "leakage" makes high scores suspicious. It turns a test of reasoning into a test of memory. This is why researchers are moving toward closed datasets or constantly rotating questions.
2. The Error Rate (The Broken Test)
Here is a shocking fact: Approximately 6.5% of MMLU questions contain errors. This was revealed in audits leading to the creation of MMLU-Redux. Some questions have ambiguous wording. Others have mislabeled correct answers. Some options are logically flawed.
If the test itself is wrong, then the maximum possible score isn’t 100%. It’s closer to 93.5%. So when two models score 89% and 90%, the difference might not be skill-it might be which model guessed correctly on a broken question. This noise makes it impossible to trust small margins between top-tier models.
3. The Format Limitation (Multiple Choice Blindness)
MMLU uses four-option multiple choice. This format forces a model to pick the "best" answer, but it hides the process. Did the model reason through the problem step-by-step? Or did it use a linguistic shortcut?
More importantly, real-world AI tasks aren’t multiple choice. You don’t ask a customer service bot, "Is the user angry? A) Yes B) No." You ask it to resolve a complaint. MMLU misses open-ended generation quality, long-horizon planning, and safety alignment. A model can ace a medical exam but still hallucinate dangerous treatment advice in a free-form chat. MMLU simply doesn’t test that.
The Successors: MMLU-Pro and Beyond
Because of these flaws, the community built better tools. The most significant is MMLU-Pro, developed by researchers at the University of Waterloo.
MMLU-Pro is harder. It focuses on proficient-level, reasoning-intensive tasks. It uses Chain-of-Thought (CoT) prompting, forcing the model to explain its steps before answering. This filters out lucky guesses and superficial pattern matching.
The drop-off in scores tells the story. While models hit >90% on original MMLU, they struggled on MMLU-Pro. For example, GPT-4o scored 72.6% on MMLU-Pro initially. By early 2026, top models like Google Gemini 3 Pro (~90.1%) and Anthropic Claude Opus 4.5 (~89.5%) have pushed scores up, but the gap remains wider than on the original test. This suggests that while factual recall is saturated, deep reasoning is still evolving.
Other variants like MMMLU address modality issues (adding images), while HELM (Holistic Evaluation of Language Models) at Stanford continues to use MMLU as part of a broader suite, acknowledging its historical value but supplementing it with other metrics.
How to Interpret Benchmarks in 2026
So, if you’re an engineer, a manager, or an investor looking at AI reports today, how should you read these numbers?
- Ignore Single-Number Scores: A headline saying "Model X beats Model Y on MMLU" is misleading if the margin is less than 1%. Given the error rate and contamination, those differences are statistically noise.
- Look for Reasoning Benchmarks: Prioritize scores on MMLU-Pro, GPQA, or live coding benchmarks. These test *how* the model thinks, not just what it knows.
- Demand Domain Breakdowns: Don’t accept an average. Ask for performance in your specific vertical. A model might be great at law but terrible at code. MMLU’s per-subject breakdowns are useful, but only if you trust the data integrity.
- Test for Contamination: Reputable evaluators now run "held-out" sets or use newly created questions to ensure the model hasn’t seen the test before.
MMLU was the hero we needed in 2020. It gave us a common language to discuss AI progress. But heroes age. Today, it serves as a baseline-a check to see if a model has basic literacy. To find out if a model is truly intelligent, safe, and useful, we need to look deeper, beyond the multiple-choice bubble sheet.
Why is MMLU considered outdated in 2026?
MMLU is considered outdated primarily due to data contamination (models memorizing answers from training data), a known 6.5% error rate in its questions, and score saturation where top models exceed 90%, making it difficult to distinguish between them. Newer benchmarks like MMLU-Pro offer more rigorous testing of reasoning capabilities.
What is the difference between MMLU and MMLU-Pro?
MMLU focuses on broad factual knowledge via multiple-choice questions. MMLU-Pro is a more challenging derivative that emphasizes complex reasoning, uses Chain-of-Thought prompting, and includes fewer, higher-quality questions designed to reduce ambiguity and test deeper understanding rather than rote memorization.
Does a high MMLU score guarantee a model is safe?
No. MMLU measures academic knowledge and exam-style problem-solving. It does not evaluate safety alignment, bias mitigation, or behavior in open-ended, real-world interactions. A model can have high factual accuracy but still generate harmful or biased content in unstructured conversations.
What is data contamination in AI benchmarks?
Data contamination occurs when benchmark questions and answers are included in a model's training dataset. This allows the model to memorize correct answers rather than deriving them through reasoning, leading to inflated scores that do not reflect true generalization abilities.
Who created the MMLU benchmark?
MMLU was created by Dan Hendrycks and his team at UC Berkeley in 2020. It was designed to provide a comprehensive evaluation of large language models across 57 diverse subjects ranging from elementary education to professional expertise.
- Jul, 3 2026
- Collin Pace
- 0
- Permalink
Written by Collin Pace
View all posts by: Collin Pace