HN Reader

AVX-512: First Impressions on Performance and Programmability

125

(a) You mention that the NVidia docs push people to use libraries, etc. to really get high performance CUDA kernels, rather than writing them themselves. My argument would be that SIMD is exactly the same - they're something really that are perfect if you're writing a BLAS implementation but are too low level for most developers thinking about a problem to make use of.

(b) You show a problem where autovectorisation fails because of branching, and jump straight to intrinsics as the solution which you basically say are ugly. Looking at the intrinsic code, you're using a mask to deal with the branching. But there's a middle ground - almost always you would want to try restructuring the problem, e.g. splitting up loops and adding masks where there's conditions - i.e. lean into the SIMD paradigm. This would also be the same advice in CUDA.

(c) As you've found, GCC actually performs quite poorly for x86-64 optimisations compared to Intel. It's not always clear cut though, the Intel Compiler for e.g. sacrifices IEEE 764 float precision and go down to ~14 digits of precision in it's defaults, because it sets the flag `-fp-model=fast -fma`. This is true of both the legacy and new Intel compiler. If you switch to `-fp-model=strict` then you may find that the results are closer.

(d) AVX512 is quite hardware specific. Some processors execute these instructions much better than others. It's really a collection of extensions, and you get frequency downclocking that's better/worse on different processors as these instructions are executed.

19 days agoby physicsguy

I found this a weird article.

If you wish to see some speedups using AVX512, without limiting yourself to C or C++, you might want to try ISPC (https://ispc.github.io/index.html).

You'll get sane aliasing rules from the perspective of performance, multi-target binaries with dynamic dispatching and a lot more control over the code generated.

19 days agoby nnevatie

> CUDA architects [...] happily exposed every ugly details of underlying hardware to the programmer if that allows a bit more performance.

After spending more than a decade dancing around all the underlying x86 hidden stuff for low-level optimization, I appreciate CUDA a lot. Everything is there under your total control. No more one size fits all. Higher barrier of entry but no surprises and less time spent debugging to figure out what landmine your code stepped into.

19 days agoby alecco

> In CPU world there is a desire to shield programmers from those low-level details, but I think there are two interesting forces at play now-a-days that’ll change it soon. On one hand, Dennard Scaling (aka free lunch) is long gone, hardware landscape is getting increasingly fragmented and specialized out of necessity, software abstractions are getting leakier, forcing developers to be aware of the lowest levels of abstraction, hardware, for good performance.

The problem is that not all programming languages expose SIMD, and even if they do it is only a portable subset, additionally the kind of skills that are required to be able to use SIMD properly isn't something everyone is confortable doing.

I certainly am not, still managed to get around with MMX and early SSE, can manage shading languages, and that is about it.

24 days agoby pjmlp

> The answer, if it’s not obvious from my tone already:), is 8%.

Not if the data is small and in cache.

> The performant route with AVX-512 would probably include the instruction vpconflictd, but I couldn’t really find any elegant way to use it.

I think the best way to do this is duplicate sum_r and count 16 times, so each pane has a seperate accumulation bucket and there can't be any conflicts. After the loop, you quickly do a sum reduction for each of the 16 buckets.

24 days agoby camel-cdr

If you have the opportunity, try out a zen5. Significant improvements.

See also https://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teard...

19 days agoby ecesena

>On one hand, Dennard Scaling (aka free lunch) is long gone, hardware landscape is getting increasingly fragmented and specialized out of necessity, software abstractions are getting leakier, forcing developers to be aware of the lowest levels of abstraction, hardware, for good performance.

There are lots of people using Javascript frameworks to build slow desktop and mobile software.

19 days agoby DeathArrow

Initial example takes array pointers without the __restrict__ keyword/extension so compiler might assume they could be aliased to same address space and will code defensively.

Would be interesting to see if auto vec performs better with that addition.

19 days agoby chillitom

What I get in these article is that the original intent on C language stands true.

Use C as a common platform denominator without crazy optimizations (like tcc). If you need performance, specialize, C gives you the tools to call assembly (or use compiler some intrinsic or even inline assembly).

Complex compiler doing crazy optimizations, in my opinion, is not worth it.

24 days agoby fithisux

Isn’t k-means memory bandwidth bound? What was the arithmetic intensity of the final code?

19 days agoby saagarjha

A factoid - earlier batches of Alder Lake 12th gen consumer CPU has a rare AVX512 _FP16_ extension. Afaik it was very very fast.

19 days agoby BoredomIsFun

For those who want to get started with SIMD programming in Rust, here is a great resource: https://kerkour.com/introduction-rust-simd

19 days agoby randomint64