Wouldn't the real test be to run all the code through the bytecode optimizer and see who's faster then?
I mean, this is sort of the same as testing the LLM output against the -O3 compiler optimization flag while compiling their programs with no optimizations. Actually, if I read TFA correctly, this is exactly what they're doing, am I wrong?
Or maybe I am wrong and they're testing their VM against compiled code, dunno?