A Chinese Open-Weight Model Beat Claude on Cybersecurity Benchmarks. Then the Scaffolding Beat Everyone.

Start with the number nobody's tweeting about. Semgrep's structured evaluation harness, a pipeline with no trillion-parameter brain required, beat GLM-5.2 and Claude Code by 15 to 20 points on the same IDOR detection task. If you're a security team deciding where to spend money, that's the finding. The orchestration layer matters more than which model sits inside it. Infrastructure people have been saying this for a while, and now there's a benchmark to point at.

The distillation allegations are stacking into a pattern. DeepSeek faced similar accusations in February. GLM-5.1 claimed the top SWE-Bench Pro spot in April with results that were never independently replicated. Vercel's Guillermo Rauch pointed out that on SWE-Marathon, GLM-5.2 scores 13.0 to Opus 4.8's 26.0. Strong on one benchmark, middling on another. But the weights are MIT-licensed and already downloaded thousands of times, so the provenance question is almost academic at this point.

Which makes the export control conversation especially uncomfortable. Two weeks ago the US government pulled Claude Fable 5 offline globally because real-time nationality filtering was impractical. Anthropic withheld Mythos over cybersecurity concerns. And now an open-weight Chinese model is posting competitive security numbers under a license that makes distribution permanently irrecoverable. TechTimes ran "AI Export Controls Fail Their First Real Test" as a headline. Hard to argue with that one.

Alex Stamos estimated six months until open-weight models caught frontier on vulnerability-finding. It's been about two and a half. Happy almost-Fourth.

Source