flat assembler
Message board for the users of flat assembler.
Index
> Main > Cool Integer Square Root algo Goto page 1, 2 Next 
Author 

r22
I was thinking about int approximations of square roots, and thought about the current guess and check methods and how they would fair speed wise with a 64bit integer.
Code: IntSqrtSSE: mov eax,[esp+4] cvtsi2sd xmm0,eax sqrtsd xmm0,xmm0 cvtsd2si eax,xmm0 ret 4 Is very fast, BUT you can't CVT 64bit integers into double FP values. Using the FPU would just make the function slow. Also because it's SIGNED integer the highest value you can do is 7FFFFFFFh. Here's a square root for all unsigned 32bit values. Code: IntSqrt: push esi push edi mov esi,[esp+12] ; IN number to find sqr root of xor ecx,ecx ;zero out xor eax,eax ; mov edi,10h ;loop counter , use 32 for 64bit int mov edx,esi .LP: shr edx,1eh shl eax,1 lea ecx,[edx+ecx*4] lea edx,[eax+eax+1] ;2x + 1 part of the binomial expand of (x+1)^2 shl esi,2 cmp edx,ecx ja .SKIP sub ecx,edx inc eax .SKIP: mov edx,esi ;unroll 15% speed boost shr edx,1eh shl eax,1 lea ecx,[edx+ecx*4] lea edx,[eax+eax+1] shl esi,2 cmp edx,ecx ja .SKIP1 sub ecx,edx inc eax .SKIP1: mov edx,esi sub edi,2 jnz .LP pop edi pop esi ret 4 Surprisingly it's only 2x slower than the SSE version (after the 15% speed increase from the unroll). Also in 64bit world the function can be easily modified to get the int sqrt of 64bit integers. I usually like to comment my code, but unless your familiar with the ambiguous math involved it really wouldn't help much. It relies on the bond and bit patterns between square root, and remaining numbers. Good Link: http://www.embedded.com/98/9802fe2.htm With the extra registers on a 64bit system this function has a lot of room for optimization. 

15 Aug 2005, 03:40 

r22
In 1FFFFFFh calls to each function
Test done on a p4 3.2ghz 1gb ram I used 65535 as the test number IntSqrt finished in 1100ms isqrt_eax_p6 finished in 2900ms When 4billion is used as the test number IntSqrt results are the same 1100ms isqrt_eax_p6 takes 5600ms Using SHLD and BSR are probably to blame for the poor performance, since both instructions are considered slow by pentium optimization manuals. 

15 Aug 2005, 20:43 

revolution
Thanks for that, later I will try on my Pent M and see if I get similar results.
Since you are using a P4 then perhaps you can try changing the "inc eax" to "add eax,1". The opt manual also mentions about avoiding the inc instruction. I would be interested to see the result. 

16 Aug 2005, 01:40 

r22
The add eax,1 had minimal to no affect on the speed.
But messing with the unroll I found that 4 unrolls gives the greatest reward (as opposed to 2). Code: IntSqrt4x: ;fastest 4x unrolls push esi push edi mov esi,[esp+12] xor ecx,ecx xor eax,eax mov edi,10h mov edx,esi .LP: shr edx,1eh shl eax,1 lea ecx,[edx+ecx*4] lea edx,[eax+eax+1] shl esi,2 cmp edx,ecx ja .SKIP sub ecx,edx add eax,1 .SKIP: mov edx,esi ;unroll shr edx,1eh shl eax,1 lea ecx,[edx+ecx*4] lea edx,[eax+eax+1] shl esi,2 cmp edx,ecx ja .SKIP1 sub ecx,edx add eax,1 .SKIP1: mov edx,esi ;unroll 2 shr edx,1eh shl eax,1 lea ecx,[edx+ecx*4] lea edx,[eax+eax+1] shl esi,2 cmp edx,ecx ja .SKIP2 sub ecx,edx add eax,1 .SKIP2: mov edx,esi ;unroll 3 shr edx,1eh shl eax,1 lea ecx,[edx+ecx*4] lea edx,[eax+eax+1] shl esi,2 cmp edx,ecx ja .SKIP3 sub ecx,edx add eax,1 .SKIP3: sub edi,4 mov edx,esi jnz .LP pop edi pop esi ret 4 950ms for 1FFFFFFh calls 150ms faster than the 2unroll version +10% Figuring out a clever way to avoid this branch would yeild the big speed gain. But the subecx,edx make it complicated. Code: cmp edx,ecx ja .SKIP3 sub ecx,edx add eax,1 .SKIP3: MOV ebx, edx SUB ebx, ecx ADC eax,0 ;; :S Can't think of an efficient way. 

16 Aug 2005, 01:50 

THEWizardGenius
Oh, you optimization fanatics Isn't there any such thing as fast enough?


16 Aug 2005, 02:31 

revolution
My results:
Code: Input: 65536 Tests: 01ffffffh isqrt_eax: 2634ms isqrt_eax_p6: 2173ms IntSqrt: 2364ms Input: 4000000000 Tests: 01ffffffh isqrt_eax: 4466ms isqrt_eax_p6: 3595ms IntSqrt: 2364ms It is curious that the times are very different. My PentM is 1300MHz but can outperform the P43200 in certain tasks. I think the two major sources of ineffiency in my code are SHLD, and ADC. The P4 performance with these two instructions is terrible. That must be why using "ja .SKIP" can still give good performance over anything that uses ADC. But one thing to remember, the input values do not change and the loop count is 16 so the branch predictor will do a good job with "ja .SKIP". Perhaps for a more 'normal' situation where the input values are less predicable the performance will suffer somewhat? 

16 Aug 2005, 03:05 

revolution
Quote: Isn't there any such thing as fast enough? 

16 Aug 2005, 03:06 

revolution
Still can't beat the FPU in the PentM
Code: isqrt_eax_fpu: movd xmm0,eax movq qword[test_value],xmm0 fild qword[test_value] fsqrt fistp qword[test_output] mov eax,[test_output] ret Code: Input: 65536 Tests: 01ffffffh isqrt_eax: 2634ms isqrt_eax_p6: 2173ms IntSqrt: 2364ms isqrt_eax_fpu: 280ms Input: 4000000000 Tests: 01ffffffh isqrt_eax: 4466ms isqrt_eax_p6: 3595ms IntSqrt: 2364ms isqrt_eax_fpu: 1512ms 

16 Aug 2005, 09:14 

MCD
r22 wrote: Using SHLD and BSR are probably to blame for the poor performance, since both instructions are considered slow by pentium optimization manuals. Actually, revolution's code is quiet optimized, except no unrolled loop and the slow SHLD in its loop. But you could rewrite this instruction if you had 1 another register available. Code: isqrt_eax: ;for plain 386's and up bsr edi,eax setnz cl dec cl shr edi,cl xor ebx,ebx and edi,2 xor edx,edx lea cl,[edi+2] ror eax,cl .digit:;shld ebx,eax,2 shl ebx,2 mov ebp,eax shr ebp,322 or ebx,ebp lea ecx,[edx*4+1] cmp ebx,ecx cmc sbb esi,esi adc edx,edx and ecx,esi shl eax,2 sub ebx,ecx sub edi,2 jns .digit mov eax,edx ret isqrt_eax_p6: ;for processors with CMOV bsr edi,eax cmovz edi,eax xor ebx,ebx and edi,2 xor edx,edx lea cl,[edi+2] ror eax,cl .digit:;shld ebx,eax,2 shl ebx,2 mov esi,eax shr esi,322 lea ecx,[edx*4+1] or ebx,esi mov esi,ebx sub ebx,ecx cmovb ebx,esi cmc adc edx,edx shl eax,2 sub edi,2 jns .digit mov eax,edx ret _________________ MCD  the inevitable return of the Mad Computer Doggy __/ .+~ .  Last edited by MCD on 18 Aug 2005, 11:42; edited 1 time in total 

16 Aug 2005, 13:07 

Matrix
Code: sqrt: ; returns: ax=Square root ; bx,cx,dx,di = undefined mov bx,128 ; initial guess mov di,bx mov cx,ax .top: mov ax,cx xor dx,dx div bx add ax,bx shr ax,1 cmp ax,bx je .finished cmp ax,di mov di,bx mov bx,ax jne .top cmp ax,di jL .finished mov ax,di .finished: ret sqrt32: ; returns: eax=Square root mov ebx,32768 ; initial guess mov edi,ebx mov ecx,eax .top: mov eax,ecx xor edx,edx div ebx add eax,ebx shr eax,1 cmp eax,ebx je .finished cmp eax,edi mov edi,ebx mov ebx,eax jne .top cmp eax,edi jl .finished mov eax,edi .finished: ret 

16 Aug 2005, 13:23 

revolution
Quote: Actually, reverends code is ... 

16 Aug 2005, 14:47 

revolution
Quote: and the slow SHLD in its loop 

16 Aug 2005, 14:51 

Reverend
revolution wrote:


16 Aug 2005, 15:02 

r22
My initial interest of the integer sqrt was for 64bit integers. The algorithm would take 32 iterstions, which would make it run 2x+ slower.
Very true about the FPU method being the fastest. After looking at the documentation you CAN load a 64bit integer onto the FPU, But is it SIGNED or UNSIGNED? If signed then 7FFFFFFFFFFFFFFFh is it's max input. [edist missed 4 F's ] After looking at Revolution's benchmarking of the FPU code it seems it takes much longer if the integer is large so I'd think a 64bit integer FSQRT would be very costly. It's a shame CVTSI2SD has to be signed or SSE would be the fastest viable 32bit solution. Interesting stuff FYI: I tried coding the whole algo in SSE, not optimized but it runs like crap since there's no way to make it parallel. The following runs very poorly Code: IntSqrt2: mov eax,[esp+4] mov ecx,10h movd xmm0,eax ;esi pxor xmm1,xmm1 ;ecx pxor xmm2,xmm2 ;eax ; movd xmm3,ecx ;edi movd xmm4,eax ;edx pxor xmm5,xmm5 ;free ; pxor xmm6,xmm6 ;free2 ; pxor xmm7,xmm7 ;free3 .LP: pslld xmm2,1 psrld xmm4,1eh paddd xmm1,xmm1 ;2x paddd xmm1,xmm1 ;4x paddd xmm1,xmm4 ;+rem movq xmm5,xmm2 paddd xmm5,xmm5 ;2n paddd xmm5,dqword[SqrtOne]; +1 movq xmm4,xmm5 pslld xmm0,2 pcmpgtd xmm5,xmm1 pmovmskb eax,xmm5 test eax,1000b jnz .Above psubd xmm1,xmm4 ; paddd xmm2,dqword[SqrtOne];++ .Above: movq xmm4,xmm0 dec ecx jnz .LP movd eax,xmm2 ret 4 align 16 SqrtOne dq 1,0 I guess FPU wins until x8664. Last edited by r22 on 16 Aug 2005, 22:39; edited 1 time in total 

16 Aug 2005, 16:15 

revolution
Quote: But is it SIGNED or UNSIGNED? Quote: 7FFFFFFFFFFFh is it's max Quote: I'd think a 64bit integer FSQRT would be very costly 

16 Aug 2005, 16:55 

MCD
Quote: But also the SBB and ADC kill the speed. And another problem is that revolution's aproach is harder to scale up to 64bit integers, unless you require the code to support 64bit long mode instructions. So after all, it's maybe not such a good one. If only newton's iteration for square roots didn't have a divide in it like newton for reciprokes have _________________ MCD  the inevitable return of the Mad Computer Doggy __/ .+~ .  Last edited by MCD on 18 Aug 2005, 11:44; edited 1 time in total 

17 Aug 2005, 14:13 

Madis731
but can't the divisions be rounded with some multiply algo?


17 Aug 2005, 15:17 

r22
MADIS
The Newton iteration can probably be used for straight integers BUT Forumla Guess = (Guess  (Guess  Number)/Guess) / 2 The divide by Guess can't be fudged using a magic number multiply because we wont know what guess will be. The divide by 2 can be simplified by (SHR Result, 1) Because of the Divid by Guess even with 710 iterations you wouldnt gain any performance because the DIV opcode is one of the slowest. 

18 Aug 2005, 19:33 

r22
As for the FPU integer square root, we all assumed it worked fine but we didn't really test it.
The integer square root for 65535 should be 255 but the FPU code returns 256 because the decimal is .999 (>.5) so it (by default) rounds up, so you would have to change the control word to get an accurate reading or subtract .5 before the FISTP either way it would cost more time than the benchmark shows. IN CONCLUSION (the horse has been beaten thoroughly now) Code: ;push number \ call isqrtSSE isqrtSSE: mov eax,[esp+4] movsd xmm1,qword[sign2unsign] ; used if bit 31 is set mov ecx,eax and eax,7FFFFFFFh cvtsi2sd xmm0,eax shl ecx,1 jnc .skip addsd xmm0,xmm1 ;make a 32bit UNSIGNED double .skip: sqrtsd xmm0,xmm0 cvttsd2si eax,xmm0 ; notice the 2 T's for truncate ret 4 align 16 sign2unsign dq 2147483648.0 ;call by push 0 \ push number \ call isqrtFPU isqrtFPU: ; returns rounded results fild qword[esp+4] fsqrt fistp dword[esp+4] mov eax,[esp+4] ret 8 isqrtSSE runs 220% faster than isqrtFPU on a static number benchmark. After realizing that my isqrtSSE has a branch and the FPU version didnt I ran another benchmark using RANDOM 32bit numbers so branch prediction would be compromized. isqrtSSE runs 25% faster than isqrtFPU on a random number benchmark. Quote:
I formally apologize for lying, SSE is better. 

18 Aug 2005, 20:06 

Goto page 1, 2 Next < Last Thread  Next Thread > 
Forum Rules:

Copyright © 19992020, Tomasz Grysztar. Also on YouTube, Twitter.
Website powered by rwasa.