flat assembler
Message board for the users of flat assembler.
Index
> Main > Cool Integer Square Root algo Goto page Previous 1, 2 |
Author |
|
MCD 19 Aug 2005, 12:37
Good r22, but another small optimization:
Code: ;push number \ call isqrtSSE isqrtSSE: mov eax,[esp+4] movsd xmm1,qword[sign2unsign] ; used if bit 31 is set btr eax,31 ; Changed, see below cvtsi2sd xmm0,eax jnc .skip addsd xmm0,xmm1 ;make a 32bit UNSIGNED double .skip: sqrtsd xmm0,xmm0 cvttsd2si eax,xmm0 ; notice the 2 T's for truncate ret 4 align 16 sign2unsign dq 2147483648.0 On my AMD Athlon XP, the "BTx" instructions are as fast as plain binary logical instructions. (Don't confuse those with the Bit Scan instructions, these are somewhat slower ) _________________ MCD - the inevitable return of the Mad Computer Doggy -||__/ .|+-~ .|| || |
|||
19 Aug 2005, 12:37 |
|
El Tangas 19 Aug 2005, 17:52
My turn. I think we can remove the branch with some huh... branch removing code. I didn't test it, but I think you get the general idea. This way, instead of branch penalty, we get size mismatch penalty, I think its smaller...
Code: ;push number \ call isqrtSSE isqrtSSE: mov eax,[esp+4] mov ecx,dword[sign2unsign+4] ;load high dword of double cdq ;edx contains 0 or 0FFFFh depending on 'sign' and eax,7FFFFFFFh cvtsi2sd xmm0,eax and edx,ecx ;edx contains high dword of 0.0 or 2147483648.0 mov dword[sign2unsign+4],edx movsd xmm1,qword[sign2unsign] addsd xmm0,xmm1 ;make a 32bit UNSIGNED double sqrtsd xmm0,xmm0 cvttsd2si eax,xmm0 ; notice the 2 T's for truncate ret 4 align 16 sign2unsign dq 2147483648.0 ;call by push 0 \ push number \ call isqrtFPU isqrtFPU: ; returns rounded results fild qword[esp+4] fsqrt fistp dword[esp+4] mov eax,[esp+4] ret 8 |
|||
19 Aug 2005, 17:52 |
|
r22 19 Aug 2005, 18:36
Shocking as it is, the speed of the NON branch version is SLOWER than that of the branched version, using random 32bit input benchmarks.
Code: ;runs slower NO BRANCH PREDICTION isqrtSSE1: ;NO BRANCH mov eax,[esp+4] ; mov ecx,eax cdq and eax,7FFFFFFFh push edx ;pairable push edx cvtsi2sd xmm0,eax movq xmm1,qword[esp] pand xmm1,dqword[sign2unsign] addsd xmm0,xmm1 ;;either adds 0 or add sign2unsign sqrtsd xmm0,xmm0 cvttsd2si eax,xmm0 pop ecx ;pairable, little faster than add esp,8 because func epilog pop edx ; ret 4 align 16 sign2unsign dq 2147483648.0,0 I'm pretty shocked apparently the extra code costs more than the prediction fails (which for a short jump of that kind are around 24+ clocks). Code: ;runs fastest, it's my initial code with 1 less instruction isqrtSSE: ;minus the extra MOVQ mov eax,[esp+4] mov ecx,eax and eax,7FFFFFFFh shl ecx,1 ;runs faster than BT cvtsi2sd xmm0,eax jnc .skip addsd xmm0,qword[sign2unsign] ;add from memory .skip: sqrtsd xmm0,xmm0 cvttsd2si eax,xmm0 ret 4 align 16 sign2unsign dq 2147483648.0 The branchless code should run faster but it doesnt :[ |
|||
19 Aug 2005, 18:36 |
|
El Tangas 19 Aug 2005, 18:44
Ah, well, with these modern CPUs one never knows what to expect
Maybe the stack [esp] is not qword aligned? Does it matter? |
|||
19 Aug 2005, 18:44 |
|
MCD 19 Aug 2005, 19:09
Yep. Modern CPUs have quiet a lot of unlinear code optimizations.
I'm just interested in how good my versions performs. r22, tested yet? |
|||
19 Aug 2005, 19:09 |
|
r22 19 Aug 2005, 19:33
MCD on my p4 3.2ghz
I edited your function with the 1 instruction less speed up so it'd be an equal test Code: isqrtSSE1: mov eax,[esp+4] btr eax,31 ; Changed, see below cvtsi2sd xmm0,eax jnc .skip addsd xmm0,qword[sign2unsign1] .skip: sqrtsd xmm0,xmm0 cvttsd2si eax,xmm0 ; notice the 2 T's for truncate ret 4 align 16 sign2unsign1 dq 2147483648.0 isqrtSSE: ;minus the extra MOVQ mov eax,[esp+4] mov ecx,eax and eax,7FFFFFFFh shl ecx,1 ;runs faster than BT cvtsi2sd xmm0,eax jnc .skip addsd xmm0,qword[sign2unsign] ;add from memory .skip: sqrtsd xmm0,xmm0 cvttsd2si eax,xmm0 ret 4 align 16 sign2unsign dq 2147483648.0 1FFFFFFh calls to each function isqrtSSE1 953ms without the edit 1000ms isqrtSSE 891ms I think the BTR opcode on the pentium is a little slow I don't know if the btr instruction is slower on a pentium or not |
|||
19 Aug 2005, 19:33 |
|
Madis731 20 Aug 2005, 19:09
bt* were slow on older Pentiums (7 clocks) but are now one clock. bsf were up to 40 clocks and bsr up to 73 clocks on these old ones, but newer Pentiums do these aswell in 1 clock. Nice upgrade
|
|||
20 Aug 2005, 19:09 |
|
Goto page Previous 1, 2 < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.