Fascinating case showing how LLM promoters will happily take "verified" benchmarks at their word.
It's easy to publish "$NEWMODEL received an X% bump in SWE-Bench Verified!!!!".
Proper research means interrogating the traces, like these researchers did (the Gist shows Claude 4 Sonnet): https://gist.github.com/jacobkahn/bd77c69d34040a9e9b10d56baa...
Commentary: https://x.com/bwasti/status/1963288443452051582, https://x.com/tmkadamcz/status/1963996138044096969