

One can easily check if a number is a power of 2: clear the lowest 1 bit (see above) and check if the result is 0. There is a trivial optimization to this  you can do it as (x & (x1)) instead of (x & ~(x & ~(x1))) (you can prove it by applying DeMorgan's laws, the law of distributivity, and throwing away the (x & ~x) term; you can also simply observe that it works for the same reason the x & ~(x1) returns the lowest bit set ;).
I remember reading about that trick in a magazine article more than 10 years ago. I also think I saw that same trick used in one of tomek's submissions a couple of years back, but I cannot remember which one it was.
Thanks for the great article! 

In two's compliment, doesn't x & x work? 

I also want to add that there is actually a pretty O(log N) solution for the problem of counting the number of bits set (N stands for the number of bits in number). It works like following (lets assume we have 4bit number A) B = A; B = (B & 0101b) + ((B & 1010b) >> 1) B = (B & 0011b) + ((B & 1100b) >> 2) So B will be number of the bits that are set to 1 in A.
There is really a plenty of such kinda tricks in book called "Hacker's Delight" or something like that :)
Anyway the article is pretty. 

Yes, I've used a variation of that bit counting method before when time was critical and bit counting was happening in critical loops. 

It depends on what you count as an operation.. seems like you wuold have to at least *read in* the number, and that's O(N), where N is the number of bits in a number.. 

I'm not sure what you mean, but I think the technique is more about parallelism  "reading in" the number might be O(n), but it's likely with a very low constant. For instance, if "reading in" means pushing a 32bit value onto the stack, then it's probably a single operation, and passing n ints is passing n*32 bits, but the time taken is n (or n*32/32 ?). Counting bits on a 32bit integer iteratively might take proportional to 32 operations, but using this parallel addition method, it might take a grand total of about 10 or 12 bitwise operations with no branching. 

wouldn't something like num[x & mask] + num[x >> half_num_bits] be faster? 

I agree. This is the fastest according to this
Also I think that method can be made faster if you use a union. Then you don't need to do (x >> half_num_bits)
bmerry thank you for the great article! In particular, I am very happy to find out about __builtin_ctz instruction. 

I am not sure what do you mean here, may you explain? 

I guess Cosmin.ro assumes the number of bits set in x is cached in num[x] (for low values of x). 

Well, it highly depends on architecture, as methods using precomputed tables are likely to perform slower than parallel computation on systems with slow memory access (which is not too unlikely thing). I am not sure, but it may happen that at least Precomp_16 will fail to preform on i686 architecture under real conditions. The matter is that the tests were artificial, if I got it well, i.e. the routine was repeatedly called in loop for some sample data, without performing any other operations, especially that ones that require memory access. As you need to store 64k of precomputed data, which is not accessed sequentially or so, it is very likely to cause cache lookup failures in real world (when doing some other things besides of just counting bits). It of course highly depends on processor internal cache size, but this size is for sure comparative with 64k. The process of resolving cache failure is not too quick, and so the Precomp_16 is likely to behave much worse in some situation. The worst thing is actually that it has undefined execution time, i.e. the execution time depends on some random conditions, and we have unexpected behaviour. It is not too good. 

Some replies to several people's comments:
In two's compliment, doesn't x & x work?
Not quite. Negation in 2's complement is by inverting and adding 1. So it will tell you if x is one less than a power of 2.
I also want to add that there is actually a pretty O(log N) solution for the problem of counting the number of bits set (N stands for the number of bits in number).
This is the exercise I referred to under "Cute tricks". I did some experiments of my own and they're consistent with those at the link dimkadimon posted (thanks for the link). These sexylooking routines seem to have a high constant factor and are more for showing off  particularly those that use a mod tend to get blown out of the water. GCC seems to use an 8bit table; the link indicates that a 16bit table is faster although depending on the application it might cause more L1 misses.
but using this parallel addition method, it might take a grand total of about 10 or 12 bitwise operations with no branching
It actually ends up being quite a lot of instructions (for 32bit). You've got two masks, a shift and an addition per iteration, plus you have to have some copies because x86 instructions are inplace. 

Although I agree that Precomp_16 may not be optimal for all machines and all applications, I still think it has the best average running time. In other words, given a random machine and a random application you are probably best of using it. Memory access is usually faster (especially if its only 64K) than most other operations. 

Not quite. Negation in 2's complement is by inverting and adding 1. Isn't adding 1 after inverting the same as subtracting one before inverting? Just to make sure, I wrote this program, which prints nothing:
public static void main(String[] args)
{
int i=0;
do
{
int lastbit1 = i & i;
int lastbit2 = i & ~(i1);
if (lastbit1 != lastbit2)
System.out.println("Inconsistent for " + i);
i++;
}
while (i != 0);
}
Edit: And about the number of operations for counting bits in parallel, you are right that I underestimated the number  a basic unoptimized version has 19 operators (and definitely some copying around underneath, but so does iteratively counting bits). It's definitely slower than a lookup table if it needs to be done frequently enough to even matter, but my point is that it's still faster than the iterative method (at least in the average case), even if you only count bits that are set and don't check bits that aren't. My feeble attempt at benchmarking (I call it feeble because the profiler is being absolutely unhelpful) indicates that this:
int bitcount = 0;
while (i != 0)
{
bitcount++;
i ^= i&i;
}
return bitcount;
takes about 4 times longer than this:
bitcounts = (i&0x55555555) + ((i>>1)&0x55555555);
bitcounts = (bitcounts&0x33333333) + ((bitcounts>>2)&0x33333333);
bitcounts = (bitcounts&0x0F0F0F0F) + ((bitcounts>>4)&0x0F0F0F0F);
bitcounts = (bitcounts&0x00FF00FF)+((bitcounts>>8)&0x00FF00FF);
bitcounts = (bitcounts&0x0000FFFF)+(bitcounts>>16);
return bitcounts;
which takes about about 10% longer than this:
if (nums == null)
{
nums = new int[256];
for (int j=1; j<256; j++)
nums[j] = nums[j^(j&j)]+1;
}
return nums[i&255]+nums[(i>>8)&255]+nums[(i>>16)&255]+nums[(i>>24)&255];
which takes about 55% longer than a 16bit version of the last function when called about 200 million times (hopefully with a minimal amount of overhead, especially since the functions seem to be getting inlined).
Edit again: This is, of course, if all you're doing is a ton of bit counting  aside from the cache space that might be hogged by a large bitcount cache in a more practical program, there's the issue that you might not be counting bits 200 million times in a normal, otherwise efficient program :) 

"Memory access is usually faster (especially if its only 64K) than most other operations."
Are you kidding? Memory access is usually much slower then other operations except if you are keeping in mind division operations and more complex ones. Memory speed is much lower than CPU speed, it have been so for years. Also 64k is not really so small amount, as e.g. intel Celeron processors often were made with only 128k of L2 cache, and so 64k is a half of cache. If the bit counting is combined with some other memory hungry operation it will really introduce a slowdown because of cache misses (note that it can also increase the number of cache misses for the other part of code).
Though 8bit precomputation probably will really outperform parallel computation in any modern x86 system, as its precomputation table is really small enough to fit into cache. I think that GCC uses 8bit precomputation because it is likely to perform much more stable as it is not likely to produce many cache misses. 
