The comment about statistics that I wanted to reply to has disappeared. That commenter said:
> I stand firm in my belief that unless you can prove how CLT applies to your input distributions, you should not assume normality. And if you don't know what you are doing, stop reporting means.
I agree. My research group stopped using Hyperfine because it ranks benchmarked commands by mean, and provides standard deviation as a substitute for a confidence measure. These are not appropriate for heavy-tailed, skewed, and otherwise non-normal distributions.
It's easy to demonstrate that most empirical runtime distributions are not normal. I wrote BestGuess [0] because we needed a better benchmarking tool. Its analysis provides measures of skew, kurtosis, and Anderson-Darling distance from normal, so that you can see how normal or not is your distribution. It ranks benchmark results using non-parametric methods. And, unlike many tools, it saves all of the raw data, making it easy to re-analyze later.
My team also discovered that Hyperfine's measurements are a bit off. It reports longer run times than other tools, including BestGuess. I believe this is due to the approach, which is to call getrusage(), then fork/exec the program to be measured, then call getrusage() again. The difference in user and system times is reported as the time used by the benchmarked command, but unfortunately this time also includes cycles spent in the Rust code for managing processes (after the fork but before the exec).
BestGuess avoids external libraries (we can see all the relevant code), does almost nothing after the fork, and uses wait4() to get measurements. The one call to wait4() gives us what the OS measured by its own accounting for the benchmarked command.
While BestGuess is still a work in progress (not yet at version 1.0), my team has started using it regularly. I plan to continue its development, and I'll write it up soon at [1].
[0] https://gitlab.com/JamieTheRiveter/bestguess
[1] https://jamiejennings.com