I would ignore this benchmark because it’s not going to predict anything for real world code.
In real world code, your caches and the CPU’s pipeline are influenced by some complex combination of what happens at the call site and what else the program is doing. So, a particular kind of call will perform better or worse than another kind of call depending on what else is happening.
The version of this benchmark that would have had predictive power is if you compared different kinds of call across a sufficiently diverse sampling of large programs that used those calls and also did other interesting things.