flat assembler
Message board for the users of flat assembler.

 Index > Main > Cool Integer Square Root algo Goto page 1, 2  Next
Author
r22

Joined: 27 Dec 2004
Posts: 805
r22 15 Aug 2005, 03:40
I was thinking about int approximations of square roots, and thought about the current guess and check methods and how they would fair speed wise with a 64bit integer.

Code:
```IntSqrtSSE:
mov eax,[esp+4]
cvtsi2sd xmm0,eax
sqrtsd xmm0,xmm0
cvtsd2si eax,xmm0
ret 4
```

Is very fast, BUT you can't CVT 64bit integers into double FP values. Using the FPU would just make the function slow. Also because it's SIGNED integer the highest value you can do is 7FFFFFFFh.

Here's a square root for all unsigned 32bit values.

Code:
```IntSqrt:
push esi
push edi
mov esi,[esp+12] ; IN number to find sqr root of
xor ecx,ecx        ;zero out
xor eax,eax        ;
mov edi,10h       ;loop counter , use 32 for 64bit int
mov edx,esi
.LP:
shr edx,1eh
shl eax,1
lea ecx,[edx+ecx*4]
lea edx,[eax+eax+1] ;2x + 1 part of the binomial expand of (x+1)^2
shl esi,2
cmp edx,ecx
ja .SKIP
sub ecx,edx
inc eax
.SKIP:
mov edx,esi
;unroll        15% speed boost
shr edx,1eh
shl eax,1
lea ecx,[edx+ecx*4]
lea edx,[eax+eax+1]
shl esi,2
cmp edx,ecx
ja .SKIP1
sub ecx,edx
inc eax
.SKIP1:
mov edx,esi
sub edi,2
jnz .LP
pop edi
pop esi
ret 4
```

Surprisingly it's only 2x slower than the SSE version (after the 15% speed increase from the unroll).

Also in 64bit world the function can be easily modified to get the int sqrt of 64bit integers.

I usually like to comment my code, but unless your familiar with the ambiguous math involved it really wouldn't help much. It relies on the bond and bit patterns between square root, and remaining numbers.

With the extra registers on a 64bit system this function has a lot of room for optimization.
15 Aug 2005, 03:40
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20170
revolution 15 Aug 2005, 08:12
I have come across this before. Here is the code I was using:
Code:
```isqrt_eax: ;for plain 386's and up
xor     ebx,ebx
xor     edx,edx
bsr     edi,eax
setnz   cl
sub     cl,1
shr     edi,cl
and     edi,-2
lea     ecx,[edi+2]
ror     eax,cl
.digit: shld    ebx,eax,2
lea     ecx,[edx*4+1]
cmp     ebx,ecx
cmc
sbb     esi,esi
and     ecx,esi
sub     ebx,ecx
shl     eax,2
sub     edi,2
jns     .digit
mov     eax,edx
ret

isqrt_eax_p6: ;for processors with CMOV
xor     ebx,ebx
xor     edx,edx
bsr     edi,eax
cmovz   edi,eax
and     edi,-2
lea     ecx,[edi+2]
ror     eax,cl
.digit: shld    ebx,eax,2
lea     ecx,[edx*4+1]
mov     esi,ebx
sub     ebx,ecx
cmovb   ebx,esi
cmc
shl     eax,2
sub     edi,2
jns     .digit
mov     eax,edx
ret    ```
Maybe you can compare the relative speeds?
15 Aug 2005, 08:12
r22

Joined: 27 Dec 2004
Posts: 805
r22 15 Aug 2005, 20:43
In 1FFFFFFh calls to each function
Test done on a p4 3.2ghz 1gb ram

I used 65535 as the test number
IntSqrt finished in 1100ms
isqrt_eax_p6 finished in 2900ms

When 4billion is used as the test number
IntSqrt results are the same 1100ms
isqrt_eax_p6 takes 5600ms

Using SHLD and BSR are probably to blame for the poor performance, since both instructions are considered slow by pentium optimization manuals.
15 Aug 2005, 20:43
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20170
revolution 16 Aug 2005, 01:40
Thanks for that, later I will try on my Pent M and see if I get similar results.

Since you are using a P4 then perhaps you can try changing the "inc eax" to "add eax,1". The opt manual also mentions about avoiding the inc instruction. I would be interested to see the result.
16 Aug 2005, 01:40
r22

Joined: 27 Dec 2004
Posts: 805
r22 16 Aug 2005, 01:50
But messing with the unroll I found that 4 unrolls gives the greatest reward (as opposed to 2).

Code:
```IntSqrt4x:   ;fastest 4x unrolls
push esi
push edi
mov esi,[esp+12]
xor ecx,ecx
xor eax,eax
mov edi,10h
mov edx,esi
.LP:
shr edx,1eh
shl eax,1
lea ecx,[edx+ecx*4]
lea edx,[eax+eax+1]
shl esi,2
cmp edx,ecx
ja .SKIP
sub ecx,edx
.SKIP:
mov edx,esi
;unroll
shr edx,1eh
shl eax,1
lea ecx,[edx+ecx*4]
lea edx,[eax+eax+1]
shl esi,2
cmp edx,ecx
ja .SKIP1
sub ecx,edx
.SKIP1:
mov edx,esi
;unroll 2
shr edx,1eh
shl eax,1
lea ecx,[edx+ecx*4]
lea edx,[eax+eax+1]
shl esi,2
cmp edx,ecx
ja .SKIP2
sub ecx,edx
.SKIP2:
mov edx,esi
;unroll 3
shr edx,1eh
shl eax,1
lea ecx,[edx+ecx*4]
lea edx,[eax+eax+1]
shl esi,2
cmp edx,ecx
ja .SKIP3
sub ecx,edx
.SKIP3:
sub edi,4
mov edx,esi
jnz .LP
pop edi
pop esi
ret 4
```

950ms for 1FFFFFFh calls
150ms faster than the 2unroll version +10%

Figuring out a clever way to avoid this branch would yeild the big speed gain. But the subecx,edx make it complicated.
Code:
```   cmp edx,ecx
ja .SKIP3
sub ecx,edx
.SKIP3:
```

MOV ebx, edx
SUB ebx, ecx
;; :S Can't think of an efficient way.
16 Aug 2005, 01:50
THEWizardGenius

Joined: 14 Jan 2005
Posts: 382
Location: California, USA
THEWizardGenius 16 Aug 2005, 02:31
Oh, you optimization fanatics Isn't there any such thing as fast enough?
16 Aug 2005, 02:31
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20170
revolution 16 Aug 2005, 03:05
My results:
Code:
```Input: 65536 Tests: 01ffffffh

isqrt_eax: 2634ms
isqrt_eax_p6: 2173ms
IntSqrt: 2364ms

Input: 4000000000 Tests: 01ffffffh
isqrt_eax: 4466ms
isqrt_eax_p6: 3595ms
IntSqrt: 2364ms
```

It is curious that the times are very different. My Pent-M is 1300MHz but can outperform the P4-3200 in certain tasks.

I think the two major sources of ineffiency in my code are SHLD, and ADC. The P4 performance with these two instructions is terrible. That must be why using "ja .SKIP" can still give good performance over anything that uses ADC. But one thing to remember, the input values do not change and the loop count is 16 so the branch predictor will do a good job with "ja .SKIP". Perhaps for a more 'normal' situation where the input values are less predicable the performance will suffer somewhat?
16 Aug 2005, 03:05
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20170
revolution 16 Aug 2005, 03:06
Quote:
Isn't there any such thing as fast enough?
Of course not, computers will never be fast enough!
16 Aug 2005, 03:06
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20170
revolution 16 Aug 2005, 09:14
Still can't beat the FPU in the Pent-M
Code:
```isqrt_eax_fpu:
movd    xmm0,eax
movq    qword[test_value],xmm0
fild    qword[test_value]
fsqrt
fistp   qword[test_output]
mov     eax,[test_output]
ret    ```
Code:
```Input: 65536 Tests: 01ffffffh

isqrt_eax: 2634ms
isqrt_eax_p6: 2173ms
IntSqrt: 2364ms
isqrt_eax_fpu: 280ms

Input: 4000000000 Tests: 01ffffffh
isqrt_eax: 4466ms
isqrt_eax_p6: 3595ms
IntSqrt: 2364ms
isqrt_eax_fpu: 1512ms    ```
16 Aug 2005, 09:14
MCD

Joined: 21 Aug 2004
Posts: 602
Location: Germany
MCD 16 Aug 2005, 13:07
r22 wrote:
Using SHLD and BSR are probably to blame for the poor performance, since both instructions are considered slow by pentium optimization manuals.
Well, my AMD Athlon XP+ gives the same kind of result. Don't ask me why, but SHLD/SHRD is f***ing slow

Actually, revolution's code is quiet optimized, except no unrolled loop and the slow SHLD in its loop. But you could rewrite this instruction if you had 1 another register available.
Code:
```isqrt_eax: ;for plain 386's and up
bsr     edi,eax
setnz   cl
dec     cl
shr     edi,cl
xor     ebx,ebx
and     edi,-2
xor     edx,edx
lea     cl,[edi+2]
ror     eax,cl
.digit:;shld    ebx,eax,2
shl     ebx,2
mov     ebp,eax
shr     ebp,32-2
or      ebx,ebp
lea     ecx,[edx*4+1]
cmp     ebx,ecx
cmc
sbb     esi,esi
and     ecx,esi
shl     eax,2
sub     ebx,ecx
sub     edi,2
jns     .digit
mov     eax,edx
ret

isqrt_eax_p6: ;for processors with CMOV
bsr     edi,eax
cmovz   edi,eax
xor     ebx,ebx
and     edi,-2
xor     edx,edx
lea     cl,[edi+2]
ror     eax,cl
.digit:;shld    ebx,eax,2
shl     ebx,2
mov     esi,eax
shr     esi,32-2
lea     ecx,[edx*4+1]
or      ebx,esi
mov     esi,ebx
sub     ebx,ecx
cmovb   ebx,esi
cmc
shl     eax,2
sub     edi,2
jns     .digit
mov     eax,edx
ret
```

_________________
MCD - the inevitable return of the Mad Computer Doggy

-||__/
.|+-~
.|| ||

Last edited by MCD on 18 Aug 2005, 11:42; edited 1 time in total
16 Aug 2005, 13:07
Matrix

Joined: 04 Sep 2004
Posts: 1166
Location: Overflow
Matrix 16 Aug 2005, 13:23
Code:
```sqrt:   ; returns: ax=Square root ; bx,cx,dx,di = undefined
mov bx,128 ; initial guess
mov di,bx
mov cx,ax
.top:
mov ax,cx
xor dx,dx
div bx
shr ax,1
cmp ax,bx
je .finished
cmp ax,di
mov di,bx
mov bx,ax
jne .top
cmp ax,di
jL .finished
mov ax,di
.finished:
ret

sqrt32: ; returns: eax=Square root
mov ebx,32768 ; initial guess
mov edi,ebx
mov ecx,eax
.top:
mov eax,ecx
xor edx,edx
div ebx
shr eax,1
cmp eax,ebx
je .finished
cmp eax,edi
mov edi,ebx
mov ebx,eax
jne .top
cmp eax,edi
jl .finished
mov eax,edi
.finished:
ret
```
16 Aug 2005, 13:23
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20170
revolution 16 Aug 2005, 14:47
Quote:
Actually, reverends code is ...
Umm, where is the code from reverend? Did I miss something?
16 Aug 2005, 14:47
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20170
revolution 16 Aug 2005, 14:51
Quote:
and the slow SHLD in its loop
But also the SBB and ADC kill the speed.
16 Aug 2005, 14:51
Reverend

Joined: 24 Aug 2004
Posts: 408
Location: Poland
Reverend 16 Aug 2005, 15:02
revolution wrote:
Quote:
Actually, reverends code is ...
Umm, where is the code from reverend? Did I miss something?
I also got confused when I read this He obviously meant revolution's code. But in fact that's a great nobilation for me as I am not good at optimizing
16 Aug 2005, 15:02
r22

Joined: 27 Dec 2004
Posts: 805
r22 16 Aug 2005, 16:15
My initial interest of the integer sqrt was for 64bit integers. The algorithm would take 32 iterstions, which would make it run 2x+ slower.

Very true about the FPU method being the fastest. After looking at the documentation you CAN load a 64bit integer onto the FPU, But is it SIGNED or UNSIGNED? If signed then 7FFFFFFFFFFFFFFFh is it's max input.
[edist missed 4 F's ]
After looking at Revolution's benchmarking of the FPU code it seems it takes much longer if the integer is large so I'd think a 64bit integer FSQRT would be very costly. It's a shame CVTSI2SD has to be signed or SSE would be the fastest viable 32bit solution.

Interesting stuff

FYI: I tried coding the whole algo in SSE, not optimized but it runs like crap since there's no way to make it parallel.
The following runs very poorly
Code:
```IntSqrt2:
mov eax,[esp+4]
mov ecx,10h
movd xmm0,eax  ;esi
pxor xmm1,xmm1 ;ecx
pxor xmm2,xmm2 ;eax
;  movd xmm3,ecx  ;edi
movd xmm4,eax  ;edx
pxor xmm5,xmm5 ;free
; pxor xmm6,xmm6 ;free2
; pxor xmm7,xmm7 ;free3
.LP:
pslld xmm2,1
psrld xmm4,1eh
movq xmm5,xmm2
movq xmm4,xmm5
pslld xmm0,2
pcmpgtd xmm5,xmm1
pmovmskb eax,xmm5
test eax,1000b
jnz .Above
psubd xmm1,xmm4 ;-
.Above:
movq xmm4,xmm0
dec ecx
jnz .LP
movd eax,xmm2
ret 4
align 16
SqrtOne dq 1,0
```

I guess FPU wins until x86-64.

Last edited by r22 on 16 Aug 2005, 22:39; edited 1 time in total
16 Aug 2005, 16:15
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20170
revolution 16 Aug 2005, 16:55
Quote:
But is it SIGNED or UNSIGNED?
Signed.
Quote:
7FFFFFFFFFFFh is it's max
Correct (but only correct with more "F"s, I assume you meant to put 15 letter "F"s).
Quote:
I'd think a 64bit integer FSQRT would be very costly
The same cost as 4e9 because it still only calculates the same number of bits precision, the absolute size of the number does not matter. 65536 is fast because it only needs 1 bit of precision to get the right answer, just like 2^32, 2^34, 2^36, ... also only need 1 bit of precision for the correct answer.
16 Aug 2005, 16:55
MCD

Joined: 21 Aug 2004
Posts: 602
Location: Germany
MCD 17 Aug 2005, 14:13
Quote:
But also the SBB and ADC kill the speed.
Well, that's not SO much taking speed, I think. And by revolution's (I actually meant revolution, not Reverend, sorry for that mistake ) code is quiet optimized, I meant that without taking into account the data sizzling with MMX and SSE too.
And another problem is that revolution's aproach is harder to scale up to 64bit integers, unless you require the code to support 64bit long mode instructions. So after all, it's maybe not such a good one.

If only newton's iteration for square roots didn't have a divide in it
like newton for reciprokes have

_________________
MCD - the inevitable return of the Mad Computer Doggy

-||__/
.|+-~
.|| ||

Last edited by MCD on 18 Aug 2005, 11:44; edited 1 time in total
17 Aug 2005, 14:13

Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
but can't the divisions be rounded with some multiply algo?
17 Aug 2005, 15:17
r22

Joined: 27 Dec 2004
Posts: 805
r22 18 Aug 2005, 19:33
The Newton iteration can probably be used for straight integers BUT
Forumla

Guess = (Guess - (Guess - Number)/Guess) / 2
The divide by Guess can't be fudged using a magic number multiply because we wont know what guess will be.
The divide by 2 can be simplified by (SHR Result, 1)

Because of the Divid by Guess even with 7-10 iterations you wouldnt gain any performance because the DIV opcode is one of the slowest.
18 Aug 2005, 19:33
r22

Joined: 27 Dec 2004
Posts: 805
r22 18 Aug 2005, 20:06
As for the FPU integer square root, we all assumed it worked fine but we didn't really test it.

The integer square root for 65535 should be 255 but the FPU code returns 256 because the decimal is .999 (>.5) so it (by default) rounds up, so you would have to change the control word to get an accurate reading or subtract .5 before the FISTP either way it would cost more time than the benchmark shows.

IN CONCLUSION (the horse has been beaten thoroughly now)
Code:
```;push number \ call isqrtSSE
isqrtSSE:
mov eax,[esp+4]
movsd xmm1,qword[sign2unsign] ; used if bit 31 is set
mov ecx,eax
and eax,7FFFFFFFh
cvtsi2sd xmm0,eax
shl ecx,1
jnc .skip
addsd xmm0,xmm1  ;make a 32bit UNSIGNED double
.skip:
sqrtsd xmm0,xmm0
cvttsd2si eax,xmm0 ; notice the 2 T's for truncate
ret 4
align 16
sign2unsign dq 2147483648.0

;call by push 0 \ push number \ call isqrtFPU
isqrtFPU: ; returns rounded results
fild qword[esp+4]
fsqrt
fistp dword[esp+4]
mov eax,[esp+4]
ret 8
```

isqrtSSE runs 220% faster than isqrtFPU on a static number benchmark.

After realizing that my isqrtSSE has a branch and the FPU version didnt I ran another benchmark using RANDOM 32bit numbers so branch prediction would be compromized.
isqrtSSE runs 25% faster than isqrtFPU on a random number benchmark.

Quote:

I guess FPU wins until x86-64.

I formally apologize for lying, SSE is better.
18 Aug 2005, 20:06
 Display posts from previous: All Posts1 Day7 Days2 Weeks1 Month3 Months6 Months1 Year Oldest FirstNewest First

 Jump to: Select a forum Official----------------AssemblyPeripheria General----------------MainTutorials and ExamplesDOSWindowsLinuxUnixMenuetOS Specific----------------MacroinstructionsOS ConstructionIDE DevelopmentProjects and IdeasNon-x86 architecturesHigh Level LanguagesProgramming Language DesignCompiler Internals Other----------------FeedbackHeapTest Area
Goto page 1, 2  Next

Forum Rules:
 You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot vote in polls in this forumYou cannot attach files in this forumYou can download files in this forum