
Finally, i solved the problem, learning and using SSE instructions. These are the benchmarks on my centrino laptop (expressed in clock cycles, mean values).
//13000 cycles (!!)
powf(f,0.5)
//63 cycles
sqrtf(f)
//30 cycles, precise as sqrtf
inline float fast_sqrt(float f) {
asm("movss (%%eax), %%xmm0;"
"sqrtss %%xmm0, %%xmm0;"
"movss %%xmm0, (%%eax);"
::"a"(&f):"xmm0","memory");
return f;
}
//5 cycles (11 bits precision)
inline float very_fast_sqrt(float f) {
asm("movss (%%eax), %%xmm0;"
"rsqrtss %%xmm0, %%xmm0;"
"rcpss %%xmm0, %%xmm0;"
"movss %%xmm0, (%%eax);"
::"a"(&f):"xmm0","memory");
return f;
}
Playing with parallel instructions, it's also possible to do up to four of operations as sqrt at the *same* time. Just what i needed. I hope this will be useful for everybody! 