flat assembler
Message board for the users of flat assembler.

Index > Main > quickest way to check if a number is a square

Goto page Previous  1, 2, 3, 4  Next
Author
Thread Post new topic Reply to topic
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17715
Location: In your JS exploiting you and your system
revolution
magicSqr wrote:
ya must've missed my

N = u²

post then Very Happy
You didn't to show how sqrt(2) cannot be part of a perfect square. The proof of irrationality is a very important step. Without that it is not a proof at all.
Post 31 Aug 2011, 21:04
View user's profile Send private message Visit poster's website Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4242
Location: 2018
edfed
N=u²
4N=4u²
???
sqrt(4)=2
Post 01 Sep 2011, 00:51
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17715
Location: In your JS exploiting you and your system
revolution
edfed wrote:
N=u²
4N=4u²
???
sqrt(4)=2
Yes. For every number that is a perfect square, four times that number will also be a perfect square. Same with nine times, sixteen times, twenty-five times, etc.
Post 01 Sep 2011, 00:53
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
revolution wrote:

The proof of irrationality is a very important step.

Do you really need to go that far? Lets assume sqrt(2) = 1.4, how could it be part of a perfect square?
Post 01 Sep 2011, 04:31
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17715
Location: In your JS exploiting you and your system
revolution
LocoDelAssembly wrote:
Lets assume sqrt(2) = 1.4, how could it be part of a perfect square?
Let's say sqrt(2) = 1.4 = 7/5

Let m = 5
m * m = 25 (a perfect square)
2 * m * m = 50
sqrt(2 * m * m)=sqrt(2) * m
sqrt(2) * m = 7/5 * 5 = 7
LocoDelAssembly wrote:
Do you really need to go that far?
Yes.

In general, if sqrt(2) = p/q then sqrt(2*q*q)=p

If p and q are integers then there would be infinitely many perfect squares of the form 2*m*m. But if there is no integer set of {p,q} that satisfies sqrt(2)=p/q (i.e. it is irrational) then there will be no perfect squares of the form 2*m*m.
Post 01 Sep 2011, 05:39
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2140
Location: Estonia
Madis731

_________________
My updated idol Very Happy http://www.agner.org/optimize/
Post 01 Sep 2011, 11:57
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Quote:

sqrt(2) * m = 7/5 * 5 = 7

This is so simple that even being very sleepy last night is no excuse... I'm so ready for retirement...
Post 01 Sep 2011, 15:09
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
Here's a quick benchmark (note: I did not test for accuracy of the functions).
Results:
Quote:

1453ms SSE v1
1078ms SSE v2 doing nonFPU IMUL
1250ms FPU v1
1250ms FPU v2
859ms FPU v3 doing nonFPU IMUL
797ms Wikipedia + FPU v3 if needed


The results show that the Wikipedia method + FPU v3 when needed is the fastest for single call non-parallel implementations.

In a parallel implementation (where you would compute on multiple input values at once) the SSE version would be much easier to extend. Perhaps the Wikipedia version could be adapted to SSE to allow for parallel execution.

Interesting stuff, here's the code.

Code:
FORMAT PE CONSOLE
ENTRY start

include 'win32ax.inc'

section '.text' code readable executable

start:
COUNTER = 7FFFFFFh
;;;; IF YOU DO NOT HAVE SSE4 COMMENT OUT THE FIRST STDCALL BENCHMARK
        stdcall   Benchmark, COUNTER, IsSquareSSE
        cinvoke   printf, <'%dms SSE v1',13,10,0>, eax
;;;;
        stdcall   Benchmark, COUNTER, IsSquareSSE2
        cinvoke   printf, <'%dms SSE v2 doing nonFPU IMUL',13,10,0>, eax
;;;;
        stdcall   Benchmark, COUNTER, IsSquareFPU
        cinvoke   printf, <'%dms FPU v1',13,10,0>, eax
;;;;
        stdcall   Benchmark, COUNTER, IsSquareFPU2
        cinvoke   printf, <'%dms FPU v2',13,10,0>, eax
;;;;
        stdcall   Benchmark, COUNTER, IsSquareFPU3
        cinvoke   printf, <'%dms FPU v3 doing nonFPU IMUL',13,10,0>, eax
;;;;
        stdcall   Benchmark, COUNTER, IsSquareWIKI
        cinvoke   printf, <'%dms Wikipedia + FPU v3 if needed',13,10,0>, eax
;;;;
        cinvoke   system, <'pause',0>
        invoke    ExitProcess, 0

align 16
Benchmark:
        PUSH      ebp
        MOV       ebp, esp
        PUSH      ebx
        PUSH      esi
        PUSH      edi

        MOV       ebx, [ebp+8]
        MOV       esi, [ebp+12]
        CALL      [GetTickCount]
        MOV       edi, eax
        JMP       .loop
align 16
.loop:
        PUSH      ebx
        CALL      esi
        DEC       ebx
        JNZ       .loop
        CALL      [GetTickCount]
        SUB       eax, edi

        POP       edi
        POP       esi
        POP       ebx
        MOV       esp, ebp
        POP       ebp
        RET       8

align 16
IsSquareSSE:
        MOV       ecx, [esp+4]
        XOR       eax, eax
        CVTSI2SD  xmm0, ecx
        SQRTSD    xmm0, xmm0
        ROUNDSD   xmm0, xmm0, 11b
        MULSD     xmm0, xmm0
        CVTTSD2SI edx, xmm0
        CMP       edx, ecx
        SETE      al
        RET       4

align 16
IsSquareSSE2:
        MOV       ecx, [esp+4]
        XOR       eax, eax
        CVTSI2SD  xmm0, ecx
        SQRTSD    xmm0, xmm0
        CVTTSD2SI edx, xmm0
        IMUL      edx, edx
        CMP       edx, ecx
        SETE      al
        RET       4

align 16
IsSquareFPU:
        fild      dword[esp+4]  ; st0 = 
        fsqrt           ; st0 = k
        frndint         ; st0 = round(k)
        fmul      st0, st0    ; st0 = round(k)²
        ficomp    dword [esp+4]; C3 := round(k)² = 
        fstsw     ax
        sahf
        MOV       eax, 0
        SETZ      al
        RET       4

align 16
IsSquareFPU2:
        fild      dword[esp+4]  ; st0 = 
        fsqrt           ; st0 = k
        frndint         ; st0 = round(k)
        MOV       edx, [esp+4]
        fmul      st0, st0    ; st0 = round(k)²
        FISTP     dword[esp-4]; C3 := round(k)² = 
        XOR       eax, eax
        CMP       edx, [esp-4]
        SETE      al
        RET       4

align 16
IsSquareFPU3:
        fild      dword[esp+4]  ; st0 = 
        fsqrt           ; st0 = k
        FISTP     dword[esp-4]; C3 := round(k)² = 
        XOR       eax, eax
        MOV       edx, [esp-4]
        IMUL      edx, edx
        CMP       edx, [esp+4]
        SETE      al
        RET       4

align 16
IsSquareWIKI:
        MOV     eax, [esp+4]
        and     eax,0ffh
        mov     cl,al
        and     al,0fh
        test    al,al
        jz      @Is_Zero
        cmp     al,1
        jz      IsSquareFPU3
        cmp     al,4
        jz      @Is_Four
        cmp     al,9
        jz      IsSquareFPU3

        XOR     eax, eax
        RET     4
        
align 16
@Is_Zero:
        mov     al,cl
        shr     al,4
        test    al,al
        jz      IsSquareFPU3
        cmp     al,1
        jz      IsSquareFPU3
        cmp     al,4
        jz      IsSquareFPU3
        cmp     al,9
        jz      IsSquareFPU3
        
        XOR     eax, eax
        RET     4

align 16
@Is_Four:

        mov     al,cl
        shr     al,4
        test    al,1
        jz      IsSquareFPU3

        XOR     eax, eax
        RET     4

section '.data' data readable writeable

_pause  db      'pause',0

section '.idata' import data readable writeable

library kernel32,'KERNEL32.DLL',\
        msvcrt,  'MSVCRT.DLL'

include 'api\Kernel32.inc'
import  msvcrt,\
              printf, 'printf',\
              system, 'system'

    
Post 01 Sep 2011, 16:21
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2140
Location: Estonia
Madis731
Code:
; On my T9300 (mobile) Core 2
3666ms SSE v1
2730ms SSE v2 doing nonFPU IMUL
3183ms FPU v1
3182ms FPU v2
2184ms FPU v3 doing nonFPU IMUL
2013ms Wikipedia + FPU v3 if needed
    

When you, r22, first posted about SSE approach, I started looking at the timings of FSQRT, FISTP, SQRT**, VSQRT**, RSQRT**, VRSQRT, ROUND**, VROUND**.

The (rather pessimistic) conclusion is: having a Sandy Bridge will help (clock counts wise) getting better benchmark results, but what is the percentage of folks having these CPUs lying around for perfect square calculations.

I don't know if 80-bit FPU precision is better than 64-bit SSE precision, but if vectorized, the theoretical speed will double (SSE or AVX?).
AVX registers have twice the size, but using the same amount of SQRT units apparently. Maybe it has got something to do with execution units not upgraded to full 256 bits.

Enough of rambling... I will try to read and understand about http://mathcentral.uregina.ca/RR/database/RR.09.95/grzesina1.html

EDIT: Scratch that! The geometrical square root didn't help a lot.
This investigation, however, is interesting: http://www.azillionmonkeys.com/qed/sqroot.html
There's some conclusion in the middle of the page saying FSQRT is the fastest you can get Smile I agree when precision is important.


Last edited by Madis731 on 01 Sep 2011, 20:56; edited 1 time in total
Post 01 Sep 2011, 16:49
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Goplat



Joined: 15 Sep 2006
Posts: 181
Goplat
Here's a way that might be faster than rounding:
Code:
        fild    dword [num]
 fsqrt
       fstp    qword [temp]
        mov     eax, [temp]
 ; num is square if and only if eax=0    
Explanation: if num is square, temp will be an integer between 0 and 46340, so as a double-precision float, its low 32 bits will be zero. If num is not square, temp will be irrational - its low 32 bits could be zero by "chance", but luckily this does not happen for any num from 0 to 0x7FFFFFFF.
Post 01 Sep 2011, 18:04
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
@Goplat nice hack, it's slightly faster then the FPU + IMUL.
Quote:

1438ms SSE v1
1078ms SSE v2 doing nonFPU IMUL
1266ms FPU v1
1250ms FPU v2
859ms FPU v3 doing nonFPU IMUL
844ms FPU v4 Goplat
765ms Wikipedia + FPU v3 if needed


Code:
align 16
IsSquareFPU4:
        fild      dword[esp+4]  ; st0 = 
        fsqrt           ; st0 = k
        XOR       eax, eax
        FSTP      qword[esp-8];
        TEST      dword[esp-8], -1
        SETZ      al
        RET       4 
    
Post 02 Sep 2011, 12:15
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Fanael



Joined: 03 Jul 2009
Posts: 168
Fanael
Something that doesn't perform any floating point operations:
Code:
align 16
IntOnly_v1:
  mov eax, [esp + 4]
  mov ecx, eax
  and ecx, 0xF
  test [ecx + .lut], -1
  jz .false

  mov edx, 0x40000000
  push ebx
  xor ebx, ebx
  rept 15
  {
    lea ecx, [ebx + edx]
    shr ebx, 1
    cmp eax, ecx
    jb @f
    sub eax, ecx
    add ebx, edx
    @@:
    shr edx, 2
  }
  lea ecx, [ebx + edx]
  shr ebx, 1
  cmp eax, ecx
  jb @f
  add ebx, edx
  @@:

  imul ebx, ebx
  cmp ebx, [esp + 4 + 4]
  sete al
  movzx eax, al

  pop ebx
  ret 4
.false:
  xor eax, eax
  ret 4

.lut db 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0    
Any questions? Very Happy
Post 02 Sep 2011, 17:11
View user's profile Send private message Reply with quote
Goplat



Joined: 15 Sep 2006
Posts: 181
Goplat
Fanael: Ah, the good old integer square root algorithm Smile But isn't the ending imul unnecessary? Since eax holds the "remainder" (original number minus ebx^2) you just need to check if it's 0. (after changing the last step to update eax, not ebx)
Post 02 Sep 2011, 17:54
View user's profile Send private message Reply with quote
Fanael



Joined: 03 Jul 2009
Posts: 168
Fanael
Oh yeah, so there's the code without that multiplication, hopefully slightly faster (still not optimized much, constant propagation will help by freeing up EDX):
Code:
align 16
IntOnly_v2:
  mov eax, [esp + 4]
  mov ecx, eax
  and ecx, 0xF
  test [ecx + .lut], -1
  jz .false

  mov edx, 0x40000000
  push ebx
  xor ebx, ebx
  rept 15
  {
    lea ecx, [ebx + edx]
    shr ebx, 1
    cmp eax, ecx
    jb @f
    sub eax, ecx
    add ebx, edx
    @@:
    shr edx, 2
  }
  lea ecx, [ebx + edx]
  pop ebx
  cmp eax, ecx
  jb @f
  sub eax, ecx
  setz al  
  movzx eax, al
  ret 4
  @@:

  test eax, eax
  setz al
  movzx eax, al
  ret 4
.false:
  xor eax, eax
  ret 4

.lut db 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0    

EDIT: removed the unnecessary "shr ebx, 1".
Also, there's the v3, basically v2 + constant propagation:
Code:
align 16
IntOnly_v3:
  mov eax, [esp + 4]
  mov ecx, eax
  and ecx, 0xF
  test [ecx + .lut], -1
  jz .false

  xor edx, edx
  rept 15 n:0
  {
    lea ecx, [edx + (1 shl ((n xor 15)*2))]
    shr edx, 1
    cmp eax, ecx
    jb @f
    sub eax, ecx
    add edx, (1 shl ((n xor 15)*2))
    @@:
  }
  lea ecx, [edx + 1]
  cmp eax, ecx
  jb @f
  sub eax, ecx
  setz al
  movzx eax, al
  ret 4
  @@:

  test eax, eax
  setz al
  movzx eax, al
  ret 4
.false:
  xor eax, eax
  ret 4

.lut db 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0    
Post 02 Sep 2011, 18:50
View user's profile Send private message Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4242
Location: 2018
edfed
in 32 bits, there are only 65536 perfect squares. then, only 65536 values to test.
no need to remember them.

just need to evaluate, like the analog to digital conversion.

a 32 bits perfect square is evaluable by the most significant set bit.

this bit is there because it was set by a sqrt(bit order).
bit0²=bit0
bit1²=bit2
bit2²=bit4
bit4²=bit8
etc...


then, if bit 16 is set, means that bit 8 is the root.
then, you have to evaluate the perfect square around our number to test.

bsr is there to find the most significant set bit. in order to know what squared number to test first.




Code:
testperfectsquare:
;eax number to test
;return eax=root if ok, -1 if not ok
bsr ecx,eax
shr ecx,1
xor edx,edx
bts edx,ecx
mov ebx,edx
imul edx,ebx
cmp edx,eax
je .ok
jl .less
.more:
something like that, i am not at home, then, i can't test and code efficientlly.
.notok:
mov ecx,-1
.ok:
mov eax,ecx
ret
    
Post 05 Sep 2011, 10:02
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17715
Location: In your JS exploiting you and your system
revolution
edfed: Are you trying to reinvent the integer square root algorithm? I think you will find that isqrt is what you are implying.
Post 05 Sep 2011, 11:25
View user's profile Send private message Visit poster's website Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4242
Location: 2018
edfed
yep!

i've found that if there is one, or more bits set to 0 between two bits set to 1, the square wil have more bits than root.
otherwise, if there is only one single contiguous 1's bitstream, there will be the same number of bit set in square than in root.

for example:
0111 gives 00110001, 3 bits for 3 bits.

0101 gives 00011001, 3 bits for 2 bits.

1100 gives 10010000, 2 bits for 2 bits

1101 gives 10101001, 4 bits for 3 bits.

i suppose that for any combinaison of bits with only one single contiguous 1s region, there will be exactlly the same number of 1s in the square. otherwise, there will be more 1s in the square than in the root;
Post 05 Sep 2011, 11:42
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17715
Location: In your JS exploiting you and your system
revolution
Code:
isqrt_eax:
       xor     ebx,ebx
     xor     edx,edx
     bsr     edi,eax
     setnz   cl
  sub     cl,1
        shr     edi,cl
      and     edi,-2
      lea     ecx,[edi+2]
 ror     eax,cl
.digit:   shld    ebx,eax,2
   lea     ecx,[edx*4+1]
       cmp     ebx,ecx
     cmc
 sbb     esi,esi
     adc     edx,edx
     and     ecx,esi
     sub     ebx,ecx
     shl     eax,2
       sub     edi,2
       jns     .digit
      mov     eax,edx
     ret

isqrt_eax_p2: ;pentium II with cmovcc
        xor     ebx,ebx
     xor     edx,edx
     bsr     edi,eax
     cmovz   edi,eax
     and     edi,-2
      lea     ecx,[edi+2]
 ror     eax,cl
.digit:   shld    ebx,eax,2
   lea     ecx,[edx*4+1]
       mov     esi,ebx
     sub     ebx,ecx
     cmovb   ebx,esi
     cmc
 adc     edx,edx
     shl     eax,2
       sub     edi,2
       jns     .digit
      mov     eax,edx
     ret

isqrt_eax_sse:
       cvtsi2ss xmm0,eax
   sqrtss  xmm0,xmm0
   cvttss2si eax,xmm0
  ret

isqrt_eax_sse2:
      cvtsi2sd xmm0,eax
   sqrtsd  xmm0,xmm0
   cvttsd2si eax,xmm0
  ret    
I have no idea which is faster.
Post 05 Sep 2011, 12:02
View user's profile Send private message Visit poster's website Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4242
Location: 2018
edfed
the pattern of the first squares number is interresting.

first bit (bit 0) is equal to bit0 of the root, the pattern is 01
bit1 is always equal to 0
bit2 have the pattern 0010, and is equal to 1 when root and 3=2
bit3 have the pattern 00010100
etc...

square numbers are interresting.
Image


Last edited by edfed on 18 Sep 2011, 20:52; edited 1 time in total
Post 05 Sep 2011, 14:51
View user's profile Send private message Visit poster's website Reply with quote
Enko



Joined: 03 Apr 2007
Posts: 678
Location: Mar del Plata
Enko
Perhaps its stupid, but why not to generate a 4gb table? (for 32bit integers)
Quote:

3:false
4:true
5:false
6:false
7:false
8:false
9:true
.......
16585484:false
16585484:false
etc...
Post 05 Sep 2011, 14:56
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3, 4  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.