flat assembler
Message board for the users of flat assembler.

Index > Main > Cool Integer Square Root algo

Goto page Previous  1, 2
Author
Thread Post new topic Reply to topic
smiddy



Joined: 31 Oct 2004
Posts: 557
smiddy 19 Aug 2005, 00:37
r22 wrote:
IN CONCLUSION (the horse has been beaten thoroughly now)


It was a bad horse to begin with! Laughing

Seriously, you never cease to amaze me. Nice code man!
Post 19 Aug 2005, 00:37
View user's profile Send private message Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 602
Location: Germany
MCD 19 Aug 2005, 12:37
Good r22, but another small optimization:
Code:
;push number \ call isqrtSSE 
isqrtSSE: 
   mov eax,[esp+4] 
   movsd xmm1,qword[sign2unsign] ; used if bit 31 is set 
   btr eax,31 ; Changed, see below
   cvtsi2sd xmm0,eax 
   jnc .skip 
   addsd xmm0,xmm1  ;make a 32bit UNSIGNED double 
 .skip: 
   sqrtsd xmm0,xmm0 
   cvttsd2si eax,xmm0 ; notice the 2 T's for truncate 
   ret 4 
 align 16 
 sign2unsign dq 2147483648.0
    

On my AMD Athlon XP, the "BTx" instructions are as fast as plain binary logical instructions. (Don't confuse those with the Bit Scan instructions, these are somewhat slower )

_________________
MCD - the inevitable return of the Mad Computer Doggy

-||__/
.|+-~
.|| ||
Post 19 Aug 2005, 12:37
View user's profile Send private message Reply with quote
El Tangas



Joined: 11 Oct 2003
Posts: 120
Location: Sunset Empire
El Tangas 19 Aug 2005, 17:52
My turn. I think we can remove the branch with some huh... branch removing code. I didn't test it, but I think you get the general idea. This way, instead of branch penalty, we get size mismatch penalty, I think its smaller...

Code:
;push number \ call isqrtSSE
isqrtSSE:
   mov eax,[esp+4]
   mov ecx,dword[sign2unsign+4] ;load high dword of double
   cdq ;edx contains 0 or 0FFFFh depending on 'sign'
   and eax,7FFFFFFFh
   cvtsi2sd xmm0,eax
   and edx,ecx ;edx contains high dword of 0.0 or 2147483648.0
   mov dword[sign2unsign+4],edx
   movsd xmm1,qword[sign2unsign]
   addsd xmm0,xmm1  ;make a 32bit UNSIGNED double
   sqrtsd xmm0,xmm0
   cvttsd2si eax,xmm0 ; notice the 2 T's for truncate
   ret 4
 align 16
 sign2unsign dq 2147483648.0


;call by push 0 \ push number \ call isqrtFPU
isqrtFPU: ; returns rounded results
   fild qword[esp+4]
   fsqrt
   fistp dword[esp+4]
   mov eax,[esp+4]
   ret 8 
    
Post 19 Aug 2005, 17:52
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 19 Aug 2005, 18:36
Shocking as it is, the speed of the NON branch version is SLOWER than that of the branched version, using random 32bit input benchmarks.

Code:
;runs slower NO BRANCH PREDICTION
isqrtSSE1:  ;NO BRANCH
   mov eax,[esp+4]
;   mov ecx,eax
   cdq
   and eax,7FFFFFFFh
   push edx ;pairable 
   push edx
   cvtsi2sd xmm0,eax
   movq xmm1,qword[esp]
   pand xmm1,dqword[sign2unsign]
   addsd xmm0,xmm1 ;;either adds 0 or add sign2unsign
   sqrtsd xmm0,xmm0
   cvttsd2si eax,xmm0
   pop ecx ;pairable, little faster than add esp,8 because func epilog
   pop edx ;
   ret 4
 align 16
 sign2unsign dq 2147483648.0,0
    


I'm pretty shocked apparently the extra code costs more than the prediction fails (which for a short jump of that kind are around 24+ clocks).

Code:
;runs fastest, it's my initial code with 1 less instruction
isqrtSSE:   ;minus the extra MOVQ
   mov eax,[esp+4]
   mov ecx,eax
   and eax,7FFFFFFFh
   shl ecx,1 ;runs faster than BT
   cvtsi2sd xmm0,eax
   jnc .skip
   addsd xmm0,qword[sign2unsign] ;add from memory
.skip:
   sqrtsd xmm0,xmm0
   cvttsd2si eax,xmm0
   ret 4
 align 16
 sign2unsign dq 2147483648.0
    


The branchless code should run faster but it doesnt :[
Post 19 Aug 2005, 18:36
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
El Tangas



Joined: 11 Oct 2003
Posts: 120
Location: Sunset Empire
El Tangas 19 Aug 2005, 18:44
Ah, well, with these modern CPUs one never knows what to expect Confused
Maybe the stack [esp] is not qword aligned? Does it matter?
Post 19 Aug 2005, 18:44
View user's profile Send private message Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 602
Location: Germany
MCD 19 Aug 2005, 19:09
Yep. Modern CPUs have quiet a lot of unlinear code optimizations.

I'm just interested in how good my versions performs. r22, tested yet?
Post 19 Aug 2005, 19:09
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 19 Aug 2005, 19:33
MCD on my p4 3.2ghz
I edited your function with the 1 instruction less speed up so it'd be an equal test

Code:
isqrtSSE1:
   mov eax,[esp+4]
   btr eax,31 ; Changed, see below
   cvtsi2sd xmm0,eax
   jnc .skip
   addsd xmm0,qword[sign2unsign1]
 .skip:
   sqrtsd xmm0,xmm0
   cvttsd2si eax,xmm0 ; notice the 2 T's for truncate
   ret 4
 align 16
 sign2unsign1 dq 2147483648.0

isqrtSSE:   ;minus the extra MOVQ
   mov eax,[esp+4]
   mov ecx,eax
   and eax,7FFFFFFFh
   shl ecx,1 ;runs faster than BT
   cvtsi2sd xmm0,eax
   jnc .skip
   addsd xmm0,qword[sign2unsign] ;add from memory
.skip:
   sqrtsd xmm0,xmm0
   cvttsd2si eax,xmm0
   ret 4
 align 16
 sign2unsign dq 2147483648.0
    



1FFFFFFh calls to each function

isqrtSSE1 953ms without the edit 1000ms
isqrtSSE 891ms

I think the BTR opcode on the pentium is a little slow

I don't know if the btr instruction is slower on a pentium or not
Post 19 Aug 2005, 19:33
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 20 Aug 2005, 19:09
bt* were slow on older Pentiums (7 clocks) but are now one clock. bsf were up to 40 clocks and bsr up to 73 clocks on these old ones, but newer Pentiums do these aswell in 1 clock. Nice upgrade Very Happy
Post 20 Aug 2005, 19:09
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.