HN Reader

My bytecode optimizer beats Copilot by 2x

Big takeaway for me: the win isn’t better prompts, it’s semantic guarantees. By proving at the bytecode level that the pixel loop is side-effect-free, you can safely split it into long-lived workers and use an order-preserving queue. It's an aggressive transform copilots won’t attempt because they can’t verify invariants. That difference in guarantees (deterministic analysis vs. probabilistic suggestion) explains the 2× gap more than anything else.

6 months agoby farkin88

Anything built to purpose (by a competent dev) will usually beat out a general purpose tool. I remember burntsushi being surprised that my purpose-built unicode segmentation code so dramatically outperformed the unicode segmentation he had in bytestring which was based on regular expressions, but personally I would be surprised if it were any different.

6 months agoby dhosek

I'm totally not surprised by this. It would be strange if, at this point, we couldn't find anything that a specialized tool could do better.

But rest assured that the LLM folks are watching, and learning from this, so the issue will probably be resolved in the next version. Of course without thanking/crediting the author of the article.

6 months agoby amelius

I have breach parser that i had written to parse through over 3 billion rows of compressed data (by parsing i simply mean searching for a particular substring), I’ve tried multiple LLMs to make it faster (currently it does so in <45 seconds on an M3 pro mac) none have been able to do that yet.

https://github.com/44za12/breach-parse-rs

Feel free to drop ideas if any.

6 months agoby 44za12

Whew compilers are still better than LLMs.

6 months agoby cr125rider

It is very likely that LLM will be able to plagiarize https://ispc.github.io/example.html and steal ready to use optimal code for Mandelbrot, while specialized optimizers are locked within a domain. Not even speaking of the fact, that author is producing graphics: the task should be solved on the GPU in the first place.

6 months agoby Lockal

I certainly expect a human to do better here but if you wanna show it, giving a one line prompt to 2nd best LLMs to one-shot it isn't really the way to do it. Use Opus and o3, and give it to an agent that can measure things and try more than once.

6 months agoby furyofantares

It's unsurprising to me that the author got this outcome. However, instead of just prompting to optimize the code, I suspect they would have gotten much stronger results from the models if they'd prompted them to write an optimizer.

6 months agoby bastawhiz

This is cool. I wonder if your VM could work in conjunction with an LLM? Have you tried making this optimizer available as an MCP, or maybe some of the calculated invariants could be exposed as well?

6 months agoby gsoltis

Author used copilot in rider, tbh it's one of the worst rider. Which llm model was used ? VSCode&VS copilot allow you to select it.

I suspect theses bench were run on the default model, ChatGPT 4o, which is now more than a year old.

6 months agoby Kuinox

Wouldn't the real test be to run all the code through the bytecode optimizer and see who's faster then?

I mean, this is sort of the same as testing the LLM output against the -O3 compiler optimization flag while compiling their programs with no optimizations. Actually, if I read TFA correctly, this is exactly what they're doing, am I wrong?

Or maybe I am wrong and they're testing their VM against compiled code, dunno?

6 months agoby UncleEntity

hopefully i am not sounding too pedantic in mentioning this. But LLMS are still deterministic if you're using the same prompt and seed , temp , (sometimes requires the same hardware even) etc.

6 months agoby lawlessone