flat assembler
Message board for the users of flat assembler.

Index > Main > Cool Integer Square Root algo

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
r22



Joined: 27 Dec 2004
Posts: 805
r22 15 Aug 2005, 03:40
I was thinking about int approximations of square roots, and thought about the current guess and check methods and how they would fair speed wise with a 64bit integer.


Code:
IntSqrtSSE:
   mov eax,[esp+4]
   cvtsi2sd xmm0,eax
   sqrtsd xmm0,xmm0
   cvtsd2si eax,xmm0
   ret 4
    

Is very fast, BUT you can't CVT 64bit integers into double FP values. Using the FPU would just make the function slow. Also because it's SIGNED integer the highest value you can do is 7FFFFFFFh.

Here's a square root for all unsigned 32bit values.

Code:
IntSqrt:
   push esi
   push edi
   mov esi,[esp+12] ; IN number to find sqr root of
   xor ecx,ecx        ;zero out
   xor eax,eax        ;
   mov edi,10h       ;loop counter , use 32 for 64bit int
   mov edx,esi
.LP:
   shr edx,1eh
   shl eax,1
   lea ecx,[edx+ecx*4]
   lea edx,[eax+eax+1] ;2x + 1 part of the binomial expand of (x+1)^2
   shl esi,2
   cmp edx,ecx
   ja .SKIP
   sub ecx,edx
   inc eax
.SKIP:
   mov edx,esi
;unroll        15% speed boost
   shr edx,1eh
   shl eax,1
   lea ecx,[edx+ecx*4]
   lea edx,[eax+eax+1]
   shl esi,2
   cmp edx,ecx
   ja .SKIP1
   sub ecx,edx
   inc eax
.SKIP1:
   mov edx,esi
   sub edi,2
   jnz .LP
   pop edi
   pop esi
   ret 4
    

Surprisingly it's only 2x slower than the SSE version (after the 15% speed increase from the unroll).

Also in 64bit world the function can be easily modified to get the int sqrt of 64bit integers.

I usually like to comment my code, but unless your familiar with the ambiguous math involved it really wouldn't help much. It relies on the bond and bit patterns between square root, and remaining numbers.
Good Link: http://www.embedded.com/98/9802fe2.htm

With the extra registers on a 64bit system this function has a lot of room for optimization.
Post 15 Aug 2005, 03:40
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20632
Location: In your JS exploiting you and your system
revolution 15 Aug 2005, 08:12
I have come across this before. Here is the code I was using:
Code:
isqrt_eax: ;for plain 386's and up
        xor     ebx,ebx
        xor     edx,edx
        bsr     edi,eax
        setnz   cl
        sub     cl,1
        shr     edi,cl
        and     edi,-2
        lea     ecx,[edi+2]
        ror     eax,cl
.digit: shld    ebx,eax,2
        lea     ecx,[edx*4+1]
        cmp     ebx,ecx
        cmc
        sbb     esi,esi
        adc     edx,edx
        and     ecx,esi
        sub     ebx,ecx
        shl     eax,2
        sub     edi,2
        jns     .digit
        mov     eax,edx
        ret

isqrt_eax_p6: ;for processors with CMOV
        xor     ebx,ebx
        xor     edx,edx
        bsr     edi,eax
        cmovz   edi,eax
        and     edi,-2
        lea     ecx,[edi+2]
        ror     eax,cl
.digit: shld    ebx,eax,2
        lea     ecx,[edx*4+1]
        mov     esi,ebx
        sub     ebx,ecx
        cmovb   ebx,esi
        cmc
        adc     edx,edx
        shl     eax,2
        sub     edi,2
        jns     .digit
        mov     eax,edx
        ret    
Maybe you can compare the relative speeds?
Post 15 Aug 2005, 08:12
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 15 Aug 2005, 20:43
In 1FFFFFFh calls to each function
Test done on a p4 3.2ghz 1gb ram

I used 65535 as the test number
IntSqrt finished in 1100ms
isqrt_eax_p6 finished in 2900ms

When 4billion is used as the test number
IntSqrt results are the same 1100ms
isqrt_eax_p6 takes 5600ms

Using SHLD and BSR are probably to blame for the poor performance, since both instructions are considered slow by pentium optimization manuals.
Post 15 Aug 2005, 20:43
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20632
Location: In your JS exploiting you and your system
revolution 16 Aug 2005, 01:40
Thanks for that, later I will try on my Pent M and see if I get similar results.

Since you are using a P4 then perhaps you can try changing the "inc eax" to "add eax,1". The opt manual also mentions about avoiding the inc instruction. I would be interested to see the result.
Post 16 Aug 2005, 01:40
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 16 Aug 2005, 01:50
The add eax,1 had minimal to no affect on the speed.
But messing with the unroll I found that 4 unrolls gives the greatest reward (as opposed to 2).

Code:
IntSqrt4x:   ;fastest 4x unrolls
   push esi
   push edi
   mov esi,[esp+12]
   xor ecx,ecx
   xor eax,eax
   mov edi,10h
   mov edx,esi
.LP:
   shr edx,1eh
   shl eax,1
   lea ecx,[edx+ecx*4]
   lea edx,[eax+eax+1]
   shl esi,2
   cmp edx,ecx
   ja .SKIP
   sub ecx,edx
   add eax,1
.SKIP:
   mov edx,esi
;unroll
   shr edx,1eh
   shl eax,1
   lea ecx,[edx+ecx*4]
   lea edx,[eax+eax+1]
   shl esi,2
   cmp edx,ecx
   ja .SKIP1
   sub ecx,edx
   add eax,1
.SKIP1:
   mov edx,esi
;unroll 2
   shr edx,1eh
   shl eax,1
   lea ecx,[edx+ecx*4]
   lea edx,[eax+eax+1]
   shl esi,2
   cmp edx,ecx
   ja .SKIP2
   sub ecx,edx
   add eax,1
.SKIP2:
   mov edx,esi
;unroll 3
   shr edx,1eh
   shl eax,1
   lea ecx,[edx+ecx*4]
   lea edx,[eax+eax+1]
   shl esi,2
   cmp edx,ecx
   ja .SKIP3
   sub ecx,edx
   add eax,1
.SKIP3:
   sub edi,4
   mov edx,esi
   jnz .LP
   pop edi
   pop esi
   ret 4
    


950ms for 1FFFFFFh calls
150ms faster than the 2unroll version +10%

Figuring out a clever way to avoid this branch would yeild the big speed gain. But the subecx,edx make it complicated.
Code:
   cmp edx,ecx
   ja .SKIP3
   sub ecx,edx
   add eax,1
.SKIP3:
    


MOV ebx, edx
SUB ebx, ecx
ADC eax,0
;; :S Can't think of an efficient way.
Post 16 Aug 2005, 01:50
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
THEWizardGenius



Joined: 14 Jan 2005
Posts: 382
Location: California, USA
THEWizardGenius 16 Aug 2005, 02:31
Oh, you optimization fanatics Razz Isn't there any such thing as fast enough? Smile
Post 16 Aug 2005, 02:31
View user's profile Send private message AIM Address Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20632
Location: In your JS exploiting you and your system
revolution 16 Aug 2005, 03:05
My results:
Code:
Input: 65536 Tests: 01ffffffh

isqrt_eax: 2634ms
isqrt_eax_p6: 2173ms
IntSqrt: 2364ms

Input: 4000000000 Tests: 01ffffffh
isqrt_eax: 4466ms
isqrt_eax_p6: 3595ms
IntSqrt: 2364ms
    

It is curious that the times are very different. My Pent-M is 1300MHz but can outperform the P4-3200 in certain tasks.

I think the two major sources of ineffiency in my code are SHLD, and ADC. The P4 performance with these two instructions is terrible. That must be why using "ja .SKIP" can still give good performance over anything that uses ADC. But one thing to remember, the input values do not change and the loop count is 16 so the branch predictor will do a good job with "ja .SKIP". Perhaps for a more 'normal' situation where the input values are less predicable the performance will suffer somewhat?
Post 16 Aug 2005, 03:05
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20632
Location: In your JS exploiting you and your system
revolution 16 Aug 2005, 03:06
Quote:
Isn't there any such thing as fast enough?
Of course not, computers will never be fast enough! Sad
Post 16 Aug 2005, 03:06
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20632
Location: In your JS exploiting you and your system
revolution 16 Aug 2005, 09:14
Still can't beat the FPU in the Pent-M
Code:
isqrt_eax_fpu:
        movd    xmm0,eax
        movq    qword[test_value],xmm0
        fild    qword[test_value]
        fsqrt
        fistp   qword[test_output]
        mov     eax,[test_output]
        ret    
Code:
Input: 65536 Tests: 01ffffffh 

isqrt_eax: 2634ms 
isqrt_eax_p6: 2173ms 
IntSqrt: 2364ms 
isqrt_eax_fpu: 280ms

Input: 4000000000 Tests: 01ffffffh 
isqrt_eax: 4466ms 
isqrt_eax_p6: 3595ms 
IntSqrt: 2364ms
isqrt_eax_fpu: 1512ms    
Post 16 Aug 2005, 09:14
View user's profile Send private message Visit poster's website Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 602
Location: Germany
MCD 16 Aug 2005, 13:07
r22 wrote:
Using SHLD and BSR are probably to blame for the poor performance, since both instructions are considered slow by pentium optimization manuals.
Well, my AMD Athlon XP+ gives the same kind of result. Don't ask me why, but SHLD/SHRD is f***ing slow Sad

Actually, revolution's code is quiet optimized, except no unrolled loop and the slow SHLD in its loop. But you could rewrite this instruction if you had 1 another register available.
Code:
isqrt_eax: ;for plain 386's and up 
        bsr     edi,eax 
        setnz   cl 
        dec     cl
        shr     edi,cl 
        xor     ebx,ebx 
        and     edi,-2 
        xor     edx,edx 
        lea     cl,[edi+2] 
        ror     eax,cl 
.digit:;shld    ebx,eax,2
        shl     ebx,2
        mov     ebp,eax
        shr     ebp,32-2
        or      ebx,ebp
        lea     ecx,[edx*4+1] 
        cmp     ebx,ecx 
        cmc 
        sbb     esi,esi 
        adc     edx,edx 
        and     ecx,esi 
        shl     eax,2 
        sub     ebx,ecx 
        sub     edi,2 
        jns     .digit 
        mov     eax,edx 
        ret 

isqrt_eax_p6: ;for processors with CMOV 
        bsr     edi,eax
        cmovz   edi,eax
        xor     ebx,ebx
        and     edi,-2
        xor     edx,edx
        lea     cl,[edi+2]
        ror     eax,cl
.digit:;shld    ebx,eax,2
        shl     ebx,2
        mov     esi,eax
        shr     esi,32-2
        lea     ecx,[edx*4+1]
        or      ebx,esi
        mov     esi,ebx
        sub     ebx,ecx
        cmovb   ebx,esi
        cmc
        adc     edx,edx
        shl     eax,2
        sub     edi,2
        jns     .digit
        mov     eax,edx
        ret
    

_________________
MCD - the inevitable return of the Mad Computer Doggy

-||__/
.|+-~
.|| ||


Last edited by MCD on 18 Aug 2005, 11:42; edited 1 time in total
Post 16 Aug 2005, 13:07
View user's profile Send private message Reply with quote
Matrix



Joined: 04 Sep 2004
Posts: 1166
Location: Overflow
Matrix 16 Aug 2005, 13:23
Code:
sqrt:   ; returns: ax=Square root ; bx,cx,dx,di = undefined
mov bx,128 ; initial guess
mov di,bx
mov cx,ax
.top:
mov ax,cx
xor dx,dx
div bx
add ax,bx
shr ax,1
cmp ax,bx
je .finished
cmp ax,di
mov di,bx
mov bx,ax
jne .top
cmp ax,di
jL .finished
mov ax,di
.finished:
ret

sqrt32: ; returns: eax=Square root
mov ebx,32768 ; initial guess
mov edi,ebx
mov ecx,eax
.top:
mov eax,ecx
xor edx,edx
div ebx
add eax,ebx
shr eax,1
cmp eax,ebx
je .finished
cmp eax,edi
mov edi,ebx
mov ebx,eax
jne .top
cmp eax,edi
jl .finished
mov eax,edi
.finished:
ret
    
Post 16 Aug 2005, 13:23
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20632
Location: In your JS exploiting you and your system
revolution 16 Aug 2005, 14:47
Quote:
Actually, reverends code is ...
Umm, where is the code from reverend? Did I miss something?
Post 16 Aug 2005, 14:47
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20632
Location: In your JS exploiting you and your system
revolution 16 Aug 2005, 14:51
Quote:
and the slow SHLD in its loop
But also the SBB and ADC kill the speed.
Post 16 Aug 2005, 14:51
View user's profile Send private message Visit poster's website Reply with quote
Reverend



Joined: 24 Aug 2004
Posts: 408
Location: Poland
Reverend 16 Aug 2005, 15:02
revolution wrote:
Quote:
Actually, reverends code is ...
Umm, where is the code from reverend? Did I miss something?
I also got confused when I read this Laughing He obviously meant revolution's code. But in fact that's a great nobilation for me Smile as I am not good at optimizing
Post 16 Aug 2005, 15:02
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 16 Aug 2005, 16:15
My initial interest of the integer sqrt was for 64bit integers. The algorithm would take 32 iterstions, which would make it run 2x+ slower.

Very true about the FPU method being the fastest. After looking at the documentation you CAN load a 64bit integer onto the FPU, But is it SIGNED or UNSIGNED? If signed then 7FFFFFFFFFFFFFFFh is it's max input.
[edist missed 4 F's Razz]
After looking at Revolution's benchmarking of the FPU code it seems it takes much longer if the integer is large so I'd think a 64bit integer FSQRT would be very costly. It's a shame CVTSI2SD has to be signed or SSE would be the fastest viable 32bit solution.

Interesting stuff

FYI: I tried coding the whole algo in SSE, not optimized but it runs like crap since there's no way to make it parallel.
The following runs very poorly
Code:
IntSqrt2:
   mov eax,[esp+4]
   mov ecx,10h
   movd xmm0,eax  ;esi
   pxor xmm1,xmm1 ;ecx
   pxor xmm2,xmm2 ;eax
 ;  movd xmm3,ecx  ;edi
   movd xmm4,eax  ;edx
   pxor xmm5,xmm5 ;free
  ; pxor xmm6,xmm6 ;free2
  ; pxor xmm7,xmm7 ;free3
.LP:
   pslld xmm2,1
   psrld xmm4,1eh
   paddd xmm1,xmm1 ;2x
   paddd xmm1,xmm1 ;4x
   paddd xmm1,xmm4 ;+rem
   movq xmm5,xmm2
   paddd xmm5,xmm5 ;2n
   paddd xmm5,dqword[SqrtOne]; +1
   movq xmm4,xmm5
   pslld xmm0,2
   pcmpgtd xmm5,xmm1
   pmovmskb eax,xmm5
   test eax,1000b
   jnz .Above
   psubd xmm1,xmm4 ;-
   paddd xmm2,dqword[SqrtOne];++
 .Above:
   movq xmm4,xmm0
   dec ecx
   jnz .LP
   movd eax,xmm2
   ret 4
align 16
SqrtOne dq 1,0
    


I guess FPU wins until x86-64.


Last edited by r22 on 16 Aug 2005, 22:39; edited 1 time in total
Post 16 Aug 2005, 16:15
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20632
Location: In your JS exploiting you and your system
revolution 16 Aug 2005, 16:55
Quote:
But is it SIGNED or UNSIGNED?
Signed.
Quote:
7FFFFFFFFFFFh is it's max
Correct (but only correct with more "F"s, I assume you meant to put 15 letter "F"s).
Quote:
I'd think a 64bit integer FSQRT would be very costly
The same cost as 4e9 because it still only calculates the same number of bits precision, the absolute size of the number does not matter. 65536 is fast because it only needs 1 bit of precision to get the right answer, just like 2^32, 2^34, 2^36, ... also only need 1 bit of precision for the correct answer.
Post 16 Aug 2005, 16:55
View user's profile Send private message Visit poster's website Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 602
Location: Germany
MCD 17 Aug 2005, 14:13
Quote:
But also the SBB and ADC kill the speed.
Well, that's not SO much taking speed, I think. And by revolution's (I actually meant revolution, not Reverend, sorry for that mistake Embarassed ) code is quiet optimized, I meant that without taking into account the data sizzling with MMX and SSE too.
And another problem is that revolution's aproach is harder to scale up to 64bit integers, unless you require the code to support 64bit long mode instructions. Sad So after all, it's maybe not such a good one.

If only newton's iteration for square roots didn't have a divide in it Evil or Very Mad Crying or Very sad
like newton for reciprokes have

_________________
MCD - the inevitable return of the Mad Computer Doggy

-||__/
.|+-~
.|| ||


Last edited by MCD on 18 Aug 2005, 11:44; edited 1 time in total
Post 17 Aug 2005, 14:13
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 17 Aug 2005, 15:17
but can't the divisions be rounded with some multiply algo?
Post 17 Aug 2005, 15:17
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 18 Aug 2005, 19:33
MADIS
The Newton iteration can probably be used for straight integers BUT
Forumla

Guess = (Guess - (Guess - Number)/Guess) / 2
The divide by Guess can't be fudged using a magic number multiply because we wont know what guess will be.
The divide by 2 can be simplified by (SHR Result, 1)

Because of the Divid by Guess even with 7-10 iterations you wouldnt gain any performance because the DIV opcode is one of the slowest.
Post 18 Aug 2005, 19:33
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 18 Aug 2005, 20:06
As for the FPU integer square root, we all assumed it worked fine but we didn't really test it.

The integer square root for 65535 should be 255 but the FPU code returns 256 because the decimal is .999 (>.5) so it (by default) rounds up, so you would have to change the control word to get an accurate reading or subtract .5 before the FISTP either way it would cost more time than the benchmark shows.

IN CONCLUSION (the horse has been beaten thoroughly now)
Code:
;push number \ call isqrtSSE
isqrtSSE:
   mov eax,[esp+4]
   movsd xmm1,qword[sign2unsign] ; used if bit 31 is set
   mov ecx,eax
   and eax,7FFFFFFFh
   cvtsi2sd xmm0,eax
   shl ecx,1
   jnc .skip
   addsd xmm0,xmm1  ;make a 32bit UNSIGNED double
 .skip:
   sqrtsd xmm0,xmm0
   cvttsd2si eax,xmm0 ; notice the 2 T's for truncate
   ret 4
 align 16
 sign2unsign dq 2147483648.0


;call by push 0 \ push number \ call isqrtFPU
isqrtFPU: ; returns rounded results
   fild qword[esp+4]
   fsqrt
   fistp dword[esp+4]
   mov eax,[esp+4]
   ret 8 
    


isqrtSSE runs 220% faster than isqrtFPU on a static number benchmark.

After realizing that my isqrtSSE has a branch and the FPU version didnt I ran another benchmark using RANDOM 32bit numbers so branch prediction would be compromized.
isqrtSSE runs 25% faster than isqrtFPU on a random number benchmark.

Quote:

I guess FPU wins until x86-64.

I formally apologize for lying, SSE is better.
Post 18 Aug 2005, 20:06
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.