HN Reader

436

165

Utilization is not a lie, it is a measurement of a well-defined quantity, but people make assumptions to extrapolate capacity models from it, and that is where reality diverges from expectations.

Hyperthreading (SMT) and Turbo (clock scaling) are only a part of the variables causing non-linearity, there are a number of other resources that are shared across cores and "run out" as load increases, like memory bandwidth, interconnect capacity, processor caches. Some bottlenecks might come even from the software, like spinlocks, which have non-linear impact on utilization.

Furthermore, most CPU utilization metrics average over very long windows, from several seconds to a minute, but what really matters for the performance of a latency-sensitive server happens in the time-scale of tens to hundreds of milliseconds, and a multi-second average will not distinguish a bursty behavior from a smooth one. The latter has likely much more capacity to scale up.

Unfortunately, the suggested approach is not that accurate either, because it hinges on two inherently unstable concepts

> Benchmark how much work your server can do before having errors or unacceptable latency.

The measurement of this is extremely noisy, as you want to detect the point where the server starts becoming unstable. Even if you look at a very simple queueing theory model, the derivatives close to saturation explode, so any nondeterministic noise is extremely amplified.

> Report how much work your server is currently doing.

There is rarely a stable definition of "work". Is it RPS? Request cost can vary even throughout the day. Is it instructions? Same, the typical IPC can vary.

Ultimately, the confidence intervals you get from the load testing approach might be as large as what you can get from building an empirical model from utilization measurement, as long as you measure your utilization correctly.

8 days agoby ot

It might be a lie, but it surely is a practical one. In my brief foray into site reliability engineering I used CPU utilisation (of CPU-bofund tasks) with queueing theory to choose how to scale servers before big events.

The %CPU suggestions ran contrary to (and were much more conservative than) the "old wisdom" that would otherwise have been used. It worked out great at much lower cost than otherwise.

What I'm trying to say is you shouldn't be afraid of using semi-crappy indicators just because they're semi-crappy. If it's the best you got it might be good enough anyway.

In the case of CPU utilisation, though, the number in production shouldn't go above 40 % for many reasons. At 40 % there's usually still a little headroom. The mistake of the author was not using fundamentals of queueing theory to avoid high utilisation!

8 days agoby kqr

Reminds me of Brendan Gregg's "CPU Utilization is Wrong" but this blog fails to discuss that blog's key point that CPU utilization is a measure of whether or not the CPU is busy, including whether the CPU is waiting [0]. That blog also explains that the IPC (instructions per cycle) metric actually measures useful work hidden within that busy state.

[0]: https://www.brendangregg.com/blog/2017-05-09/cpu-utilization...

8 days agoby mustache_kimono

This is bang on, you can't count the hyperthreads as double the performance, typically they are actually in practice only going to bring 15-30% if the job works well with it and their use will double the latency. Failing to account for loss in clockspeed as the core utilisation climbs is another way its not linear and in modern software for the desktop its really something to pay careful attention to.

It should be possible from the information you can get on a CPU from the OS to better estimate utilisation involving at the very least these two factors. It becomes a bit more tricky to start to account for significantly going past the cache or available memory bandwidth and the potential drop in performance to existing threads that occurs from the increased pipeline stalls. But it can definitely be done better than it is currently.

8 days agoby PaulKeeble

The way they refer to cores in their system is confusing and non-standard. The author talks about a 5900X as a 24 core machine and discusses as if there are 24 cores, 12 of which are piggybacking on the other 12. In reality, there are 24 hyperthreads that are pretty much pairwise symmetric that execute on top of 12 cores with two sets of instruction pipeline sharing same underlying functional units.

8 days agoby tgma

There's many ways CPU utilization fails to work as expected.

I didn't expect an article on this style. I was expecting the normal Linux/Windows utilization but wtf it's all RAM bottlenecked and the CPU is actually quiet and possibly down clocking thing.

CPU Utilization is only how many cores are given threads to run by the OS (be it Windows or Linux). Those threads could be 100% blocked on memcpy but that's still CPU utilization.

-------

Hyperthreads help: if one thread is truly CPU bound (or even more specifically: AVX / Vector unit bound), while a 2nd thread is hyperthreaded together that's memcpy / RAM bound, you'll magically get more performance due to higher utilization of resources. (Load/store units are separate from AVX compute units).

In any case, this is a perennial subject with always new discoveries about how CPU Utilization is far less intuitive than many think. Still kinda fun to learn about new perspectives on this matter in any case.

8 days agoby dragontamer

Author discovers that performance does not scale proportionally to %CPU utilisation, and gets instead to the conclusion that %CPU utilisation is a lie.

There are many reasons for the lack of a proportional relationship, even in the case where you do not have hyperthreading or downclocking (in which cases you just need to interpret %CPU utilisation in that context, rather than declare it "a lie"). Even in apple silicon where these are usually not an issue, you often do not get an exactly proportional scaling. There may be overheads when utilising multiple cores wrt how data is passed around, or resource bottlenecks other than CPU.

8 days agoby freehorse

This hits so close to home. I once tried to explain to a manager that a server at 60% utilization had zero room left, and they looked at me like I had two heads. I wish I had this article back then!

8 days agoby judge123

The benchmark is basically application performance testing, which is the most accurate representation you can get. Test the specific app(s) your server is running, with real-world data/scenarios, and keep cranking up the requests, until the server falls over. Nothing else will give you as accurate an indication of your server's actual maximum performance with that app. Do that for every variable that's relevant (# requests/s, payload size, # parameters, etc), so you have multiple real-world maximum-performance indicators to configure your observability monitors for.

One way to get closer to reliable performance is to apply cpu scheduler limits to what runs your applications to keep them below a given threshold. This way you can better ensure you can sustain a given amount of performance. You don't want to run at 100% cpu for long, especially if disk i/o becomes hampered, system load skyrockets, and availability starts to plummet. Two thousand servers with 5000ms ping times due to system load is not a fun day at the office.

(And actually you'll never get a completely accurate view, as performance can change per-server. Rack two identical servers in two different racks, run the same app on each, and you may see different real-world performance. One rack may be hotter than the other, there could be hidden hardware or firmware differences, etc. Even within a server, if one CPU is just nearer a hotter component than on another server, for reasons)

8 days agoby 0xbadcafebee

Uses stress-ng for benchmarking, even though the stress-ng documentation says it is not suitable for benchmarking. It was written to max out one component until it burns. Using a real app, like Memcached or Postgres would show more realistic numbers, closer to what people use in production. The difference is not major, 50% utilization is closer to 80% in real load, but it breaks down faster. Stress-ng is nicely linear until 100%, memcached will have a hockey stick curve at the end.

8 days agoby CCs

Tried to explain this in a job interview 5 years ago. They thought I was a bullshitter

8 days agoby kristopolous

I remember being stuck in a discussion with management one time, that went something like this: Manager: CPU utilisation is 100% under load! We have to migrate to bigger instances. Me: but is the CPU actually doing useful work?

(chat, it was not. busy waiting is CPU utilisation too)

8 days agoby swiftcoder

These days I treat CPU usage as just a hint, not a conclusion. I also look at response times, queue lengths, and try to figure out what the app is actually doing when it looks idle.

8 days agoby ChaoPrayaWave

How many times has hyperthreading been an actual performance benefit in processors? I cannot count how many times an article has come out saying you'll get better performance out of your <insert processor here> by turning off hyperthreading in the BIOS.

It's gotta be at least 2 out of every 3 chip generations going back to the original implementation, where you're better off without it than with.

8 days agoby hinkley

Funny that it talks about matrixprod, which I think is not that relevant as benchmark — unless you care about x87 performance specifically. I recently sent a pull request to try to address that in a generic manner: https://github.com/ColinIanKing/stress-ng/pull/561

Yet I'm still surprised by this benchmark. On both Zen2 and Zen4 in my tests (5900X from the article is Zen3), matrixprod still benefits from hyperthreading and scales a bit after all the physical cores are filled, unlike what the article results show.

All of this is tangential of course, as I'd tend to agree that CPU utilization% is just an imprecise metric and should only be used as a measure of "is something running".

8 days agoby Aissen

I think looking at power consumption is potentially a more interesting canary when using very high core count parts.

I've ran some ML experiments on my 5950x and I can tell that the CPU utilization figure is entirely decoupled from physical reality by observing the amount of flicker induced in my office lighting by the PWM noise in the machine. There are some code paths that show 10% utilization across all cores but make the cicadas outside my office window stop buzzing because the semiconductors get so loud. Other code paths show all cores 100% maxed flatline and it's like the machine isn't even on.

8 days agoby bob1029

The lie is that hyper thread "cores" are equal to real "cores". Maybe this is what happens when an over 20-year old technology (hack) becomes ubiquitous and gets forgotten about? (We have to rediscover why our performance measurements don't seem to make sense?)

The other thing I think we have a hard time visualizing is that processor is only either executing (100%) or its waiting to execute (0%) and that happens over varying timescales... so trying to assign a % in between inherently means you're averaging over some arbitrary timescale...

8 days agoby morning-coffee

This has been my experience running production workloads as well. Anytime CPU% goes over 50-60% suddenly it'll spike to 100% rather quickly, and the app/service is unusable. Learned to scale earlier than first thought.

8 days agoby N_Lens

I think it's more for cores, right? % util is just % of idle cycles across all logical cores as far as I know.

It wouldn't really make sense to include all parts of the CPU in the calculation.

8 days agoby fennecfoxy

Windows users try this:

Ctrl-Alt-Del then launch TaskManager.

In TaskManager, click the "Performance" tab and see the simple stats.

While on the Performance tab, then click the ellipsis (. . .) menu, so you can then open ResourceMonitor.

Then close TaskManager.

In ResourceMonitor, under the Overview tab, for the CPU click the column header for "Average CPU" so that the processes using the most CPU are shown top-down from most usage to least.

In Overview, for Disk click the Write (B/sec) column header, for Network click Send (B/sec), and for Memory click Commit (KB).

Then under the individual CPU, Memory, Disk, and Network tabs click on the similar column headers. Under any tab now you should be able to see the most prominent resource usages.

Notice how your CPU settles down after a while of idling.

Then click on the Disk tab to focus your attention on that one exclusively.

Let it sit for 5 or 10 minutes then check your CPU usage. See if it's been climbing gradually higher while you weren't looking.

8 days agoby fuzzfactor

I like his empirical approach to get to the root significance of the cpu %-age indicator. Software engineers and data analysts take discrete "data" measurements and statistics for granted.

"data" / "stats" are only a report, and that report is often incorrect.

8 days agoby tonymet

I'm surprised nobody has mentioned OpenBSD yet.

They've been advocating against SMT for a long while, citing security risks and inconsistent performance gains. I don't know which HW/CPU bug in the long series of rowhammer, meltdown, spectre, etc prompted the action, but they've completely disabled SMT in the default installation at some point.

The core idea by itself is fine: keep the ALUs busy. Maybe security-wise, the present trade-off is acceptable, if you can instruct the scheduler to put threads from the same security domain on the same physical core. (How to tell when two threads are not a threat to each other is left up as an exercise.)

8 days agoby rollcat

Yeah and those tests don't even trigger some memory or cache contention ...

8 days agoby gbin

Read kernel code to see how CPU utilisation is calculated. In essence, count scheduled threads to execute and divide by number of cores. Any latency (eg. wait for memory) is still calculated as busy core.

8 days agoby smallstepforman

A worse lie is memory usage reporting, I think in every major OS it is understated and misreported. In case with Linux, I wanted to know who is using memory, and tried to add PSS values for every process, I never got back the total memory usage. In case with Windows/Mac I judge by screenshot of their tools which show unrealistically small values.

As for the article, the slowdown can be also caused by increased use of shared resources like caches, TLBs, branch predictors.

7 days agoby codedokode

GPU utilisation as reported in Task Manager also seems quite a big lie, it bears little relation to Watts / TDP.

8 days agoby HPsquared

CPU utilization alone is misleading. Pair it with per core load average or runqueue length to see how threads are actually queuing. That view often reveals the real bottleneck, whether it is I/O, memory, or scheduling delays.

7 days agoby aaa_2006