This feels a bit too optimistic, in practice it often gets stuck going down a rabbit hole (and burning up your requests / tokens doing it!).
Like even when I tested it on a clean assessment (albeit with Cursor in this case) - https://jamesmcm.github.io/blog/claude-data-engineer/ - it did very well in agent mode, but the questions it got wrong were worrying because they're the sort of things that a human might not notice either.
That said I do think you could get a lot more accuracy between the agent checking and running its own answers, and then also sending its diff to a very strong LLM like o3 or Gemini Pro 2.5 to review it - it's just a bit expensive to do that atm.
The main issue on real projects is that just having enough context to even approach problems, and build and run tests is very difficult when you have 100k+ lines of code and it takes 15 minutes to clean build and run tests. And it feels like we're still years away from having all of the above, plus a large enough context window that this is a non-issue, for a reasonable price.