Gemini Deep Think earned a gold medal at the 2025 International Mathematical Olympiad, working end to end in natural language within the 4.5-hour time limit. On ClockBench, the same model read analog clocks correctly 50.1% of the time. Humans score around 90%.
The Stanford 2026 AI Index, published this month, is full of contrasts like this. On OSWorld, a benchmark for computer tasks across operating systems, the best model jumped from 12% to 66% in about a year. Within striking distance of the human baseline of 72%. It also means the system fails a third of the time on tasks a person would handle routinely.
Stanford calls this a "jagged frontier," borrowing from a 2023 Harvard study by Dell'Acqua et al. The concept has been around for a while. What's striking in this year's data is that the jaggedness doesn't obviously appear to be shrinking.
Whether Scale Fills the Valleys
A lot of AI investment carries an implicit assumption: train on more data, add more compute, and the valleys fill in alongside the peaks. A 2024 paper in Nature by Zhou et al. suggests otherwise. Larger, more instructable models fail on easy problems too, with more confidence. The authors found that scaled-up models produce "apparently sensible yet wrong" answers that human supervisors frequently overlook. They later noted the same patterns in models released after publication, including OpenAI's o1 and Anthropic's Claude 3.5 Sonnet.
For anyone designing human-in-the-loop systems, this is where the ground shifts. If the model's errors on simple tasks look plausible enough that reviewers wave them through, the oversight layer stops functioning as oversight. The gold medal performance is genuine. So is the quiet confidence on the wrong answer.
The Practitioners Aren't Waiting
Princeton researchers evaluated 12 frontier models over 18 months and found that despite rapid capability improvements, reliability "barely budged." The four dimensions they measured turned out to be independent of raw capability. A highly capable system can be unreliable. A less capable system can be reliable within its narrower range.
Practitioners seem to be absorbing this. In the Stack Overflow 2025 survey, trust in AI accuracy fell from 40% to 29% year over year, while usage climbed to 84%. Top frustration: outputs that are "almost right, but not quite." Two-thirds of developers reported that fixing near-miss AI-generated code is eating into their time.
Databricks found that organizations using evaluation tools move nearly six times more AI systems to production than those that don't. Their recommendation: use public benchmarks early to sanity-check capabilities, then build your own evaluations to determine whether a system is actually ready to ship.
Two Curves
Most of the market has been treating peak performance and deployment readiness as points on the same curve, separated by time. The Stanford data makes a case that they're different curves, with different slopes.
And this creates a real problem for enterprise teams evaluating systems right now. Two vendors can post similar benchmark scores while delivering very different production reliability. Benchmarks reward peak performance on curated tasks. Consistency across the messy distribution of inputs a system actually encounters is a separate measurement, and one that rarely shows up in a procurement process. Pick the higher-scoring model and you may end up with the less deployable one. The metrics that would distinguish between them, the kind Databricks and Princeton are pointing toward, aren't yet standard.
The gold medal is real. The clock-reading score is also real. At run ten thousand, which number predicts what happens?
Things to follow up on...
- Reliability has its own science: Princeton's Kapoor and Narayanan published an interactive dashboard tracking four reliability dimensions across frontier models, arguing that the AI industry lacks even a shared definition of what "reliable" means.
- Benchmark rot is accelerating: Stanford found that evaluations designed to last years are saturating in months, with popular benchmarks like GSM8K carrying invalid-question rates as high as 42%.
- The lab-to-production gap, quantified: AWS research documented a 37% performance drop when multi-agent systems move from controlled benchmarks to real-world deployment, with 50x cost variation for similar accuracy levels.
- Governance predicts production, not capability: Databricks' survey of 20,000+ organizations found that companies using AI governance tools get over 12 times more AI projects into production than those without them.

