Somehow his taxonomy of optimizations doesn't have a category for things like "unroll a loop", "fuse loops", "block your array accesses", "avoid false sharing", or "strength-reduce your operations". He has a "use a better algorithm" category, but his example is switching from bubble sort to selection sort, which is a much more disruptive kind of change than such micro-optimizations, which are usually considered to be "the same algorithm".
This is rather puzzling, given that he starts off with, "Premature optimisation might be the root of all evil, but", quoting Knuth, and in context, Knuth was primarily talking about precisely such micro-optimizations: https://dl.acm.org/doi/10.1145/356635.356640 p. 267 (p. 7/41)
Specifically, the optimizations Knuth was exhibiting in that section of the paper are:
- speeding up a sequential array search by copying the search key to an empty spot at the end of the array, so that the loop only requires one termination test (are keys equal?) rather than two (are keys equal? is index in bounds?), speeding it up on typical machines of the time by 50%;
- unrolling that version of the loop 2× in a way that required gotos, speeding it up by a further 12%.
None of the optimization techniques being applied there fit into any of Tratt’s categories!
Now, I can't claim to be any sort of optimization expert, but I think these kinds of micro-optimizations are still important, though, as Knuth pointed out at the time, not for all code, and often the drawbacks are not worth it. Sometimes compilers will save you. The C compiler might unroll a loop for you, for example. But often it won't, and it certainly isn't going to pull something like the sentinel trick, because it doesn't know you aren't reading the array in another thread.
This sort of depends on what context you're working in; arguably if your hotspots are in Python, even (as Tratt says) in PyPy, you have bigger opportunities for optimization available than manually unrolling a loop to reduce loop termination tests. But I've found that there's often something available in that sort of grungy tweaking; I've frequently been able to double the speed of Numpy code, for example, by specifying the output array for each Numpy operation, thus reducing the cache footprint. The code looks like shit afterwards, sadly.