There's a fun approach where do the computation as a tree in parallel. You do a little masking and shifting to add all the even-numbered bits to all the odd-numbered bits, and come up with a set of (assuming we're working on a 64-bit value) 32 partial sums of 2 bits each. Then you add pairs of those to get 16 partial sums of 4 bits each, and so forth until you get to the top. This requires six sums, plus shifts and masks for each one.
I don't know if compilers are able to detect this and compile it down to a single instruction, though.