99% pass rate? Maybe that’s super impressive because it’s a stress test, but if 1% of my code fails to compile I think I’d be in deep shit.
Also - one of the main arguments of vibe coding advocators is that you just need to check the result several times and tell the AI assistant what needs fixing. Isn’t a compiler test suite ideal for such workflow? Why couldn’t they just feed the test failures back to the model and tell it to fix them, iterating again and again until they get it to work 100%?
Maybe they did, that’s how they got to 99%. The remaining issues are so intricate/complex the LLM just can’t solve them no matter how many test cases you give it.
Also - one of the main arguments of vibe coding advocators is that you just need to check the result several times and tell the AI assistant what needs fixing. Isn’t a compiler test suite ideal for such workflow? Why couldn’t they just feed the test failures back to the model and tell it to fix them, iterating again and again until they get it to work 100%?
Maybe they did, that’s how they got to 99%. The remaining issues are so intricate/complex the LLM just can’t solve them no matter how many test cases you give it.