HN Reader

Every large scale cloud/SaaS software has been using parallelism for decades [well, almost two decades give or take]. Requests run in parallel on millions of servers and one server processes many requests in parallel.

In that sort of setup the old school optimize one program by doing things in parallel is somewhat less relevant because everything by its nature is massively parallel and complex operations are often broken down into bits that all run in parallel. In that sort of environment over-parallelizing can actually be a pessimization because there is often some cost around breaking the work up to run in parallel and your little bit of this massively parallel system may have some fraction of resources anyways.

Not to say there's no value in knowing "old school" parallelizing (including using SIMD or GPUs or whatnot) but for many types of applications that's not where I'd focus.

6 days agoby YZF

There’s also ‘don’t do it’ ( just turn the whole thing off because it’s actually not needed ) and ‘use a better middleman’ ( os, compiler etc while not going lower level). In my experience these are very cheap and usually not looked at and most devs jump at the other 5.

8 days agoby Agingcoder

Somehow his taxonomy of optimizations doesn't have a category for things like "unroll a loop", "fuse loops", "block your array accesses", "avoid false sharing", or "strength-reduce your operations". He has a "use a better algorithm" category, but his example is switching from bubble sort to selection sort, which is a much more disruptive kind of change than such micro-optimizations, which are usually considered to be "the same algorithm".

This is rather puzzling, given that he starts off with, "Premature optimisation might be the root of all evil, but", quoting Knuth, and in context, Knuth was primarily talking about precisely such micro-optimizations: https://dl.acm.org/doi/10.1145/356635.356640 p. 267 (p. 7/41)

Specifically, the optimizations Knuth was exhibiting in that section of the paper are:

- speeding up a sequential array search by copying the search key to an empty spot at the end of the array, so that the loop only requires one termination test (are keys equal?) rather than two (are keys equal? is index in bounds?), speeding it up on typical machines of the time by 50%;

- unrolling that version of the loop 2× in a way that required gotos, speeding it up by a further 12%.

None of the optimization techniques being applied there fit into any of Tratt’s categories!

Now, I can't claim to be any sort of optimization expert, but I think these kinds of micro-optimizations are still important, though, as Knuth pointed out at the time, not for all code, and often the drawbacks are not worth it. Sometimes compilers will save you. The C compiler might unroll a loop for you, for example. But often it won't, and it certainly isn't going to pull something like the sentinel trick, because it doesn't know you aren't reading the array in another thread.

This sort of depends on what context you're working in; arguably if your hotspots are in Python, even (as Tratt says) in PyPy, you have bigger opportunities for optimization available than manually unrolling a loop to reduce loop termination tests. But I've found that there's often something available in that sort of grungy tweaking; I've frequently been able to double the speed of Numpy code, for example, by specifying the output array for each Numpy operation, thus reducing the cache footprint. The code looks like shit afterwards, sadly.

6 days agoby kragen

Another kind of optimisation is to see if you are solving an unnecessarily general problem, which could be replaced by a narrower problem specific to your actual situation & workload, that might be easier to solve.

see also: Mike Acton's 2014 CppCon talk "Data-Oriented Design and C++": https://www.youtube.com/watch?v=rX0ItVEVjHc

6 days agoby shoo

I think re-evaluation of biases regarding what a single computer/cpu/thread can do would go a long way for a lot of developers.

Having hundreds or thousands of cores to work with is meaningless if the state between those threads is constantly in another castle. Having to go to DRAM or another CCX is a very expensive operation compared to staying within L1/L2 on a single core.

Put differently, if the problem generally has a serialized narrative of events that must be followed (e.g., how we action the current event depends on all prior events being resolved), then spreading the problem across many threads will only slow us down. Contention will immediately dominate. This is where we get things like single writer principle from.

There aren't very many problems that a business would pay you to write software for that are not of the serialized narrative variety. Most ordinary people aren't thinking in terms of mutexes and semaphores. They are thinking in terms of what is effectively a gigantic interpreter lock over their enterprise.

5 days agoby bob1029

A bunch of these dates are way wrong because the author is stuck in an x86 consumer grade mindset. Even if we didn’t have multicore on our desktops a lot of us had shipped on multicore long before 2014 which is when he marks the era as beginning.

Multiprocessor was already an established solution. I worked on 4 core machines around 2008. As would anyone who worked on server software instead of just desktop.

6 days agoby hinkley

I was surprised there was no mention of OpenMP in the discussion of shared-memory parallel programming in C.

OpenMP is a lot easier to work with (and is more concise) than pthreads. It is introduced using pragma directives (so you can maintain serial and parallel versions in the same code). It includes support for many common parallel patterns (e.g. distribution of array based and task-based calculations) and makes the synchronisation easier (or automatic). It also gives you powerful control over work sharing to reduce load imbalance, etc.

It's available in most common C and C++ compilers and many other accelerated programming solutions wrap the API (e.g. you can use OpenMP within a Cython module, called from a Python script).

5 days agoby jn6118

I was reporting to a developer some research I'd read that said that about half of errors were logical (and, or, off by 1) and half misusing an API.

The developer replied "and the other half comes from concurrency".

All of us listening did the math and burst out laughing for the trenchant truth

6 days agoby readthenotes1

I am often shocked how poor language support for paralelism is. Sorbets asnc/await make writing that kind of code a breeze. Python and Java are much less ergonomic for this. Might have improved with the newest Java versions, but I did not try that yet.

5 days agoby HdS84

The title implies that the 5th kind of thing is revolutionary and unknown. I mean yeah? You parallelize a task that can be parallelized it's faster...

I thought it was going to be something that was unexpected. But alas that's happening less and less on HN.

6 days agoby ninetyninenine

I’m surprised that “Add caching” is not on that list. Maybe he bundled it under one of the first four. Caching is used so extensively as an optimization method that we don’t even notice it. L1 cache, edge computing, memoization, web browser cache, and file system caches are examples of how pervasive caching is.

5 days agoby koliber

Hm,, doesn't mention Amdahl's Law applied to parallelisation https://en.wikipedia.org/wiki/Amdahl%27s_law#Parallel_progra...

Which tl;dr is that if only 50% of the whole process can be parallelised, then even if you parallelise that portion as much as you can, you'll get a < 2x speedup. If only 10% can be parallelised, then you're limited to a 1.1x speedup by parallelising it.

5 days agoby SideburnsOfDoom

0: Make your requirements less dumb.

6 days agoby infogulch

Let's be real he's over blowing the novelty of parallelism. We know what parallelism is. The real "5th optimization" is quantum computing.

6 days agoby ninetyninenine