flat assembler
Message board for the users of flat assembler.
![]() Goto page Previous 1, 2, 3, 4 Next |
Author |
|
edfed 01 Sep 2011, 00:51
N=u²
4N=4u² ??? sqrt(4)=2 |
|||
![]() |
|
revolution 01 Sep 2011, 00:53
edfed wrote: N=u² |
|||
![]() |
|
LocoDelAssembly 01 Sep 2011, 04:31
revolution wrote:
Do you really need to go that far? Lets assume sqrt(2) = 1.4, how could it be part of a perfect square? |
|||
![]() |
|
revolution 01 Sep 2011, 05:39
LocoDelAssembly wrote: Lets assume sqrt(2) = 1.4, how could it be part of a perfect square? Let m = 5 m * m = 25 (a perfect square) 2 * m * m = 50 sqrt(2 * m * m)=sqrt(2) * m sqrt(2) * m = 7/5 * 5 = 7 LocoDelAssembly wrote: Do you really need to go that far? In general, if sqrt(2) = p/q then sqrt(2*q*q)=p If p and q are integers then there would be infinitely many perfect squares of the form 2*m*m. But if there is no integer set of {p,q} that satisfies sqrt(2)=p/q (i.e. it is irrational) then there will be no perfect squares of the form 2*m*m. |
|||
![]() |
|
Madis731 01 Sep 2011, 11:57
|
|||
![]() |
|
LocoDelAssembly 01 Sep 2011, 15:09
Quote:
This is so simple that even being very sleepy last night is no excuse... I'm so ready for retirement... |
|||
![]() |
|
r22 01 Sep 2011, 16:21
Here's a quick benchmark (note: I did not test for accuracy of the functions).
Results: Quote:
The results show that the Wikipedia method + FPU v3 when needed is the fastest for single call non-parallel implementations. In a parallel implementation (where you would compute on multiple input values at once) the SSE version would be much easier to extend. Perhaps the Wikipedia version could be adapted to SSE to allow for parallel execution. Interesting stuff, here's the code. Code: FORMAT PE CONSOLE ENTRY start include 'win32ax.inc' section '.text' code readable executable start: COUNTER = 7FFFFFFh ;;;; IF YOU DO NOT HAVE SSE4 COMMENT OUT THE FIRST STDCALL BENCHMARK stdcall Benchmark, COUNTER, IsSquareSSE cinvoke printf, <'%dms SSE v1',13,10,0>, eax ;;;; stdcall Benchmark, COUNTER, IsSquareSSE2 cinvoke printf, <'%dms SSE v2 doing nonFPU IMUL',13,10,0>, eax ;;;; stdcall Benchmark, COUNTER, IsSquareFPU cinvoke printf, <'%dms FPU v1',13,10,0>, eax ;;;; stdcall Benchmark, COUNTER, IsSquareFPU2 cinvoke printf, <'%dms FPU v2',13,10,0>, eax ;;;; stdcall Benchmark, COUNTER, IsSquareFPU3 cinvoke printf, <'%dms FPU v3 doing nonFPU IMUL',13,10,0>, eax ;;;; stdcall Benchmark, COUNTER, IsSquareWIKI cinvoke printf, <'%dms Wikipedia + FPU v3 if needed',13,10,0>, eax ;;;; cinvoke system, <'pause',0> invoke ExitProcess, 0 align 16 Benchmark: PUSH ebp MOV ebp, esp PUSH ebx PUSH esi PUSH edi MOV ebx, [ebp+8] MOV esi, [ebp+12] CALL [GetTickCount] MOV edi, eax JMP .loop align 16 .loop: PUSH ebx CALL esi DEC ebx JNZ .loop CALL [GetTickCount] SUB eax, edi POP edi POP esi POP ebx MOV esp, ebp POP ebp RET 8 align 16 IsSquareSSE: MOV ecx, [esp+4] XOR eax, eax CVTSI2SD xmm0, ecx SQRTSD xmm0, xmm0 ROUNDSD xmm0, xmm0, 11b MULSD xmm0, xmm0 CVTTSD2SI edx, xmm0 CMP edx, ecx SETE al RET 4 align 16 IsSquareSSE2: MOV ecx, [esp+4] XOR eax, eax CVTSI2SD xmm0, ecx SQRTSD xmm0, xmm0 CVTTSD2SI edx, xmm0 IMUL edx, edx CMP edx, ecx SETE al RET 4 align 16 IsSquareFPU: fild dword[esp+4] ; st0 = k² fsqrt ; st0 = k frndint ; st0 = round(k) fmul st0, st0 ; st0 = round(k)² ficomp dword [esp+4]; C3 := round(k)² = k² fstsw ax sahf MOV eax, 0 SETZ al RET 4 align 16 IsSquareFPU2: fild dword[esp+4] ; st0 = k² fsqrt ; st0 = k frndint ; st0 = round(k) MOV edx, [esp+4] fmul st0, st0 ; st0 = round(k)² FISTP dword[esp-4]; C3 := round(k)² = k² XOR eax, eax CMP edx, [esp-4] SETE al RET 4 align 16 IsSquareFPU3: fild dword[esp+4] ; st0 = k² fsqrt ; st0 = k FISTP dword[esp-4]; C3 := round(k)² = k² XOR eax, eax MOV edx, [esp-4] IMUL edx, edx CMP edx, [esp+4] SETE al RET 4 align 16 IsSquareWIKI: MOV eax, [esp+4] and eax,0ffh mov cl,al and al,0fh test al,al jz @Is_Zero cmp al,1 jz IsSquareFPU3 cmp al,4 jz @Is_Four cmp al,9 jz IsSquareFPU3 XOR eax, eax RET 4 align 16 @Is_Zero: mov al,cl shr al,4 test al,al jz IsSquareFPU3 cmp al,1 jz IsSquareFPU3 cmp al,4 jz IsSquareFPU3 cmp al,9 jz IsSquareFPU3 XOR eax, eax RET 4 align 16 @Is_Four: mov al,cl shr al,4 test al,1 jz IsSquareFPU3 XOR eax, eax RET 4 section '.data' data readable writeable _pause db 'pause',0 section '.idata' import data readable writeable library kernel32,'KERNEL32.DLL',\ msvcrt, 'MSVCRT.DLL' include 'api\Kernel32.inc' import msvcrt,\ printf, 'printf',\ system, 'system' |
|||
![]() |
|
Madis731 01 Sep 2011, 16:49
Code: ; On my T9300 (mobile) Core 2 3666ms SSE v1 2730ms SSE v2 doing nonFPU IMUL 3183ms FPU v1 3182ms FPU v2 2184ms FPU v3 doing nonFPU IMUL 2013ms Wikipedia + FPU v3 if needed When you, r22, first posted about SSE approach, I started looking at the timings of FSQRT, FISTP, SQRT**, VSQRT**, RSQRT**, VRSQRT, ROUND**, VROUND**. The (rather pessimistic) conclusion is: having a Sandy Bridge will help (clock counts wise) getting better benchmark results, but what is the percentage of folks having these CPUs lying around for perfect square calculations. I don't know if 80-bit FPU precision is better than 64-bit SSE precision, but if vectorized, the theoretical speed will double (SSE or AVX?). AVX registers have twice the size, but using the same amount of SQRT units apparently. Maybe it has got something to do with execution units not upgraded to full 256 bits. Enough of rambling... I will try to read and understand about http://mathcentral.uregina.ca/RR/database/RR.09.95/grzesina1.html EDIT: Scratch that! The geometrical square root didn't help a lot. This investigation, however, is interesting: http://www.azillionmonkeys.com/qed/sqroot.html There's some conclusion in the middle of the page saying FSQRT is the fastest you can get ![]() Last edited by Madis731 on 01 Sep 2011, 20:56; edited 1 time in total |
|||
![]() |
|
Goplat 01 Sep 2011, 18:04
Here's a way that might be faster than rounding:
Code: fild dword [num] fsqrt fstp qword [temp] mov eax, [temp] ; num is square if and only if eax=0 |
|||
![]() |
|
r22 02 Sep 2011, 12:15
@Goplat nice hack, it's slightly faster then the FPU + IMUL.
Quote:
Code: align 16 IsSquareFPU4: fild dword[esp+4] ; st0 = k² fsqrt ; st0 = k XOR eax, eax FSTP qword[esp-8]; TEST dword[esp-8], -1 SETZ al RET 4 |
|||
![]() |
|
Fanael 02 Sep 2011, 17:11
Something that doesn't perform any floating point operations:
Code: align 16 IntOnly_v1: mov eax, [esp + 4] mov ecx, eax and ecx, 0xF test [ecx + .lut], -1 jz .false mov edx, 0x40000000 push ebx xor ebx, ebx rept 15 { lea ecx, [ebx + edx] shr ebx, 1 cmp eax, ecx jb @f sub eax, ecx add ebx, edx @@: shr edx, 2 } lea ecx, [ebx + edx] shr ebx, 1 cmp eax, ecx jb @f add ebx, edx @@: imul ebx, ebx cmp ebx, [esp + 4 + 4] sete al movzx eax, al pop ebx ret 4 .false: xor eax, eax ret 4 .lut db 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0 ![]() |
|||
![]() |
|
Goplat 02 Sep 2011, 17:54
Fanael: Ah, the good old integer square root algorithm
![]() |
|||
![]() |
|
Fanael 02 Sep 2011, 18:50
Oh yeah, so there's the code without that multiplication, hopefully slightly faster (still not optimized much, constant propagation will help by freeing up EDX):
Code: align 16 IntOnly_v2: mov eax, [esp + 4] mov ecx, eax and ecx, 0xF test [ecx + .lut], -1 jz .false mov edx, 0x40000000 push ebx xor ebx, ebx rept 15 { lea ecx, [ebx + edx] shr ebx, 1 cmp eax, ecx jb @f sub eax, ecx add ebx, edx @@: shr edx, 2 } lea ecx, [ebx + edx] pop ebx cmp eax, ecx jb @f sub eax, ecx setz al movzx eax, al ret 4 @@: test eax, eax setz al movzx eax, al ret 4 .false: xor eax, eax ret 4 .lut db 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0 EDIT: removed the unnecessary "shr ebx, 1". Also, there's the v3, basically v2 + constant propagation: Code: align 16 IntOnly_v3: mov eax, [esp + 4] mov ecx, eax and ecx, 0xF test [ecx + .lut], -1 jz .false xor edx, edx rept 15 n:0 { lea ecx, [edx + (1 shl ((n xor 15)*2))] shr edx, 1 cmp eax, ecx jb @f sub eax, ecx add edx, (1 shl ((n xor 15)*2)) @@: } lea ecx, [edx + 1] cmp eax, ecx jb @f sub eax, ecx setz al movzx eax, al ret 4 @@: test eax, eax setz al movzx eax, al ret 4 .false: xor eax, eax ret 4 .lut db 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0 |
|||
![]() |
|
edfed 05 Sep 2011, 10:02
in 32 bits, there are only 65536 perfect squares. then, only 65536 values to test.
no need to remember them. just need to evaluate, like the analog to digital conversion. a 32 bits perfect square is evaluable by the most significant set bit. this bit is there because it was set by a sqrt(bit order). bit0²=bit0 bit1²=bit2 bit2²=bit4 bit4²=bit8 etc... then, if bit 16 is set, means that bit 8 is the root. then, you have to evaluate the perfect square around our number to test. bsr is there to find the most significant set bit. in order to know what squared number to test first. Code: testperfectsquare: ;eax number to test ;return eax=root if ok, -1 if not ok bsr ecx,eax shr ecx,1 xor edx,edx bts edx,ecx mov ebx,edx imul edx,ebx cmp edx,eax je .ok jl .less .more: something like that, i am not at home, then, i can't test and code efficientlly. .notok: mov ecx,-1 .ok: mov eax,ecx ret |
|||
![]() |
|
revolution 05 Sep 2011, 11:25
edfed: Are you trying to reinvent the integer square root algorithm? I think you will find that isqrt is what you are implying.
|
|||
![]() |
|
edfed 05 Sep 2011, 11:42
yep!
i've found that if there is one, or more bits set to 0 between two bits set to 1, the square wil have more bits than root. otherwise, if there is only one single contiguous 1's bitstream, there will be the same number of bit set in square than in root. for example: 0111 gives 00110001, 3 bits for 3 bits. 0101 gives 00011001, 3 bits for 2 bits. 1100 gives 10010000, 2 bits for 2 bits 1101 gives 10101001, 4 bits for 3 bits. i suppose that for any combinaison of bits with only one single contiguous 1s region, there will be exactlly the same number of 1s in the square. otherwise, there will be more 1s in the square than in the root; |
|||
![]() |
|
revolution 05 Sep 2011, 12:02
Code: isqrt_eax: xor ebx,ebx xor edx,edx bsr edi,eax setnz cl sub cl,1 shr edi,cl and edi,-2 lea ecx,[edi+2] ror eax,cl .digit: shld ebx,eax,2 lea ecx,[edx*4+1] cmp ebx,ecx cmc sbb esi,esi adc edx,edx and ecx,esi sub ebx,ecx shl eax,2 sub edi,2 jns .digit mov eax,edx ret isqrt_eax_p2: ;pentium II with cmovcc xor ebx,ebx xor edx,edx bsr edi,eax cmovz edi,eax and edi,-2 lea ecx,[edi+2] ror eax,cl .digit: shld ebx,eax,2 lea ecx,[edx*4+1] mov esi,ebx sub ebx,ecx cmovb ebx,esi cmc adc edx,edx shl eax,2 sub edi,2 jns .digit mov eax,edx ret isqrt_eax_sse: cvtsi2ss xmm0,eax sqrtss xmm0,xmm0 cvttss2si eax,xmm0 ret isqrt_eax_sse2: cvtsi2sd xmm0,eax sqrtsd xmm0,xmm0 cvttsd2si eax,xmm0 ret |
|||
![]() |
|
edfed 05 Sep 2011, 14:51
the pattern of the first squares number is interresting.
first bit (bit 0) is equal to bit0 of the root, the pattern is 01 bit1 is always equal to 0 bit2 have the pattern 0010, and is equal to 1 when root and 3=2 bit3 have the pattern 00010100 etc... square numbers are interresting. ![]() Last edited by edfed on 18 Sep 2011, 20:52; edited 1 time in total |
|||
![]() |
|
Enko 05 Sep 2011, 14:56
Perhaps its stupid, but why not to generate a 4gb table? (for 32bit integers)
Quote:
|
|||
![]() |
|
Goto page Previous 1, 2, 3, 4 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.