Benchmarks like this one are designed to thoroughly test the model across several iterations. 15% is a MASSIVE discrepancy.
Come on Anthropic, admit what you're doing already and let us access your best models unhindered, even if it costs us more. At the moment we just all feel short-changed.
This is genuinely very helpful. I'm planning a MacBook pro purchase with local inference in mind and now see I'll have to aim for a slightly higher memory option because the Gemma A4 26B MoE is not all that!
I have upgraded my M4 Pro 24GB to M5 Pro 48GB yesterday. The same Gemma 4 MoE model (4bit, don't remember which version) runs about 8x faster on M5 Pro and loads 2x times faster in memory.
You don't know if it's the newer model or the increase in RAM. If someone has already got 48GB it they might not benefit much. You changed 2 things at once.
I've had this thought myself too. Going off on a slight tangent: I think there's also loads of useful stuff in domains like either of these which maps amazingly well to AI agent system design, but there's such a huge discrepancy between the knowledge bases of the fields that no benefit ever really surfaces.
(Speaking from the perspective of someone who simultaneously loves high-performance compute and agentic AI haha)
I will always maintain that the best benchmark is just trying it out for yourself.
The most practical parallel for me is all the people posting about how some open-source model has "achieved X on Y benchmark - beating out Opus 4.6!"
It's all show and everyone cheats.
They shouldn't be surprised at the thousands moving to Codex every day.
reply