flat assembler
Message board for the users of flat assembler.

 Index > Main > quickest way to check if a number is a square Goto page Previous  1, 2, 3, 4  Next
Author
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 19097
revolution 31 Aug 2011, 21:04
magicSqr wrote:
ya must've missed my

N = u²

post then
You didn't to show how sqrt(2) cannot be part of a perfect square. The proof of irrationality is a very important step. Without that it is not a proof at all.
31 Aug 2011, 21:04
edfed

Joined: 20 Feb 2006
Posts: 4283
Location: Now
edfed 01 Sep 2011, 00:51
N=u²
4N=4u²
???
sqrt(4)=2
01 Sep 2011, 00:51
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 19097
revolution 01 Sep 2011, 00:53
edfed wrote:
N=u²
4N=4u²
???
sqrt(4)=2
Yes. For every number that is a perfect square, four times that number will also be a perfect square. Same with nine times, sixteen times, twenty-five times, etc.
01 Sep 2011, 00:53
LocoDelAssembly

Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 01 Sep 2011, 04:31
revolution wrote:

The proof of irrationality is a very important step.

Do you really need to go that far? Lets assume sqrt(2) = 1.4, how could it be part of a perfect square?
01 Sep 2011, 04:31
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 19097
revolution 01 Sep 2011, 05:39
LocoDelAssembly wrote:
Lets assume sqrt(2) = 1.4, how could it be part of a perfect square?
Let's say sqrt(2) = 1.4 = 7/5

Let m = 5
m * m = 25 (a perfect square)
2 * m * m = 50
sqrt(2 * m * m)=sqrt(2) * m
sqrt(2) * m = 7/5 * 5 = 7
LocoDelAssembly wrote:
Do you really need to go that far?
Yes.

In general, if sqrt(2) = p/q then sqrt(2*q*q)=p

If p and q are integers then there would be infinitely many perfect squares of the form 2*m*m. But if there is no integer set of {p,q} that satisfies sqrt(2)=p/q (i.e. it is irrational) then there will be no perfect squares of the form 2*m*m.
01 Sep 2011, 05:39

Joined: 25 Sep 2003
Posts: 2139
Location: Estonia

_________________
My updated idol http://www.agner.org/optimize/
01 Sep 2011, 11:57
LocoDelAssembly

Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 01 Sep 2011, 15:09
Quote:

sqrt(2) * m = 7/5 * 5 = 7

This is so simple that even being very sleepy last night is no excuse... I'm so ready for retirement...
01 Sep 2011, 15:09
r22

Joined: 27 Dec 2004
Posts: 805
r22 01 Sep 2011, 16:21
Here's a quick benchmark (note: I did not test for accuracy of the functions).
Results:
Quote:

1453ms SSE v1
1078ms SSE v2 doing nonFPU IMUL
1250ms FPU v1
1250ms FPU v2
859ms FPU v3 doing nonFPU IMUL
797ms Wikipedia + FPU v3 if needed

The results show that the Wikipedia method + FPU v3 when needed is the fastest for single call non-parallel implementations.

In a parallel implementation (where you would compute on multiple input values at once) the SSE version would be much easier to extend. Perhaps the Wikipedia version could be adapted to SSE to allow for parallel execution.

Interesting stuff, here's the code.

Code:
```FORMAT PE CONSOLE
ENTRY start

include 'win32ax.inc'

start:
COUNTER = 7FFFFFFh
;;;; IF YOU DO NOT HAVE SSE4 COMMENT OUT THE FIRST STDCALL BENCHMARK
stdcall   Benchmark, COUNTER, IsSquareSSE
cinvoke   printf, <'%dms SSE v1',13,10,0>, eax
;;;;
stdcall   Benchmark, COUNTER, IsSquareSSE2
cinvoke   printf, <'%dms SSE v2 doing nonFPU IMUL',13,10,0>, eax
;;;;
stdcall   Benchmark, COUNTER, IsSquareFPU
cinvoke   printf, <'%dms FPU v1',13,10,0>, eax
;;;;
stdcall   Benchmark, COUNTER, IsSquareFPU2
cinvoke   printf, <'%dms FPU v2',13,10,0>, eax
;;;;
stdcall   Benchmark, COUNTER, IsSquareFPU3
cinvoke   printf, <'%dms FPU v3 doing nonFPU IMUL',13,10,0>, eax
;;;;
stdcall   Benchmark, COUNTER, IsSquareWIKI
cinvoke   printf, <'%dms Wikipedia + FPU v3 if needed',13,10,0>, eax
;;;;
cinvoke   system, <'pause',0>
invoke    ExitProcess, 0

align 16
Benchmark:
PUSH      ebp
MOV       ebp, esp
PUSH      ebx
PUSH      esi
PUSH      edi

MOV       ebx, [ebp+8]
MOV       esi, [ebp+12]
CALL      [GetTickCount]
MOV       edi, eax
JMP       .loop
align 16
.loop:
PUSH      ebx
CALL      esi
DEC       ebx
JNZ       .loop
CALL      [GetTickCount]
SUB       eax, edi

POP       edi
POP       esi
POP       ebx
MOV       esp, ebp
POP       ebp
RET       8

align 16
IsSquareSSE:
MOV       ecx, [esp+4]
XOR       eax, eax
CVTSI2SD  xmm0, ecx
SQRTSD    xmm0, xmm0
ROUNDSD   xmm0, xmm0, 11b
MULSD     xmm0, xmm0
CVTTSD2SI edx, xmm0
CMP       edx, ecx
SETE      al
RET       4

align 16
IsSquareSSE2:
MOV       ecx, [esp+4]
XOR       eax, eax
CVTSI2SD  xmm0, ecx
SQRTSD    xmm0, xmm0
CVTTSD2SI edx, xmm0
IMUL      edx, edx
CMP       edx, ecx
SETE      al
RET       4

align 16
IsSquareFPU:
fild      dword[esp+4]  ; st0 = k²
fsqrt           ; st0 = k
frndint         ; st0 = round(k)
fmul      st0, st0    ; st0 = round(k)²
ficomp    dword [esp+4]; C3 := round(k)² = k²
fstsw     ax
sahf
MOV       eax, 0
SETZ      al
RET       4

align 16
IsSquareFPU2:
fild      dword[esp+4]  ; st0 = k²
fsqrt           ; st0 = k
frndint         ; st0 = round(k)
MOV       edx, [esp+4]
fmul      st0, st0    ; st0 = round(k)²
FISTP     dword[esp-4]; C3 := round(k)² = k²
XOR       eax, eax
CMP       edx, [esp-4]
SETE      al
RET       4

align 16
IsSquareFPU3:
fild      dword[esp+4]  ; st0 = k²
fsqrt           ; st0 = k
FISTP     dword[esp-4]; C3 := round(k)² = k²
XOR       eax, eax
MOV       edx, [esp-4]
IMUL      edx, edx
CMP       edx, [esp+4]
SETE      al
RET       4

align 16
IsSquareWIKI:
MOV     eax, [esp+4]
and     eax,0ffh
mov     cl,al
and     al,0fh
test    al,al
jz      @Is_Zero
cmp     al,1
jz      IsSquareFPU3
cmp     al,4
jz      @Is_Four
cmp     al,9
jz      IsSquareFPU3

XOR     eax, eax
RET     4

align 16
@Is_Zero:
mov     al,cl
shr     al,4
test    al,al
jz      IsSquareFPU3
cmp     al,1
jz      IsSquareFPU3
cmp     al,4
jz      IsSquareFPU3
cmp     al,9
jz      IsSquareFPU3

XOR     eax, eax
RET     4

align 16
@Is_Four:

mov     al,cl
shr     al,4
test    al,1
jz      IsSquareFPU3

XOR     eax, eax
RET     4

_pause  db      'pause',0

section '.idata' import data readable writeable

library kernel32,'KERNEL32.DLL',\
msvcrt,  'MSVCRT.DLL'

include 'api\Kernel32.inc'
import  msvcrt,\
printf, 'printf',\
system, 'system'

```
01 Sep 2011, 16:21

Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Code:
```; On my T9300 (mobile) Core 2
3666ms SSE v1
2730ms SSE v2 doing nonFPU IMUL
3183ms FPU v1
3182ms FPU v2
2184ms FPU v3 doing nonFPU IMUL
2013ms Wikipedia + FPU v3 if needed
```

When you, r22, first posted about SSE approach, I started looking at the timings of FSQRT, FISTP, SQRT**, VSQRT**, RSQRT**, VRSQRT, ROUND**, VROUND**.

The (rather pessimistic) conclusion is: having a Sandy Bridge will help (clock counts wise) getting better benchmark results, but what is the percentage of folks having these CPUs lying around for perfect square calculations.

I don't know if 80-bit FPU precision is better than 64-bit SSE precision, but if vectorized, the theoretical speed will double (SSE or AVX?).
AVX registers have twice the size, but using the same amount of SQRT units apparently. Maybe it has got something to do with execution units not upgraded to full 256 bits.

Enough of rambling... I will try to read and understand about http://mathcentral.uregina.ca/RR/database/RR.09.95/grzesina1.html

EDIT: Scratch that! The geometrical square root didn't help a lot.
This investigation, however, is interesting: http://www.azillionmonkeys.com/qed/sqroot.html
There's some conclusion in the middle of the page saying FSQRT is the fastest you can get I agree when precision is important.

Last edited by Madis731 on 01 Sep 2011, 20:56; edited 1 time in total
01 Sep 2011, 16:49
Goplat

Joined: 15 Sep 2006
Posts: 181
Goplat 01 Sep 2011, 18:04
Here's a way that might be faster than rounding:
Code:
```        fild    dword [num]
fsqrt
fstp    qword [temp]
mov     eax, [temp]
; num is square if and only if eax=0    ```
Explanation: if num is square, temp will be an integer between 0 and 46340, so as a double-precision float, its low 32 bits will be zero. If num is not square, temp will be irrational - its low 32 bits could be zero by "chance", but luckily this does not happen for any num from 0 to 0x7FFFFFFF.
01 Sep 2011, 18:04
r22

Joined: 27 Dec 2004
Posts: 805
r22 02 Sep 2011, 12:15
@Goplat nice hack, it's slightly faster then the FPU + IMUL.
Quote:

1438ms SSE v1
1078ms SSE v2 doing nonFPU IMUL
1266ms FPU v1
1250ms FPU v2
859ms FPU v3 doing nonFPU IMUL
844ms FPU v4 Goplat
765ms Wikipedia + FPU v3 if needed

Code:
```align 16
IsSquareFPU4:
fild      dword[esp+4]  ; st0 = k²
fsqrt           ; st0 = k
XOR       eax, eax
FSTP      qword[esp-8];
TEST      dword[esp-8], -1
SETZ      al
RET       4
```
02 Sep 2011, 12:15
Fanael

Joined: 03 Jul 2009
Posts: 168
Fanael 02 Sep 2011, 17:11
Something that doesn't perform any floating point operations:
Code:
```align 16
IntOnly_v1:
mov eax, [esp + 4]
mov ecx, eax
and ecx, 0xF
test [ecx + .lut], -1
jz .false

mov edx, 0x40000000
push ebx
xor ebx, ebx
rept 15
{
lea ecx, [ebx + edx]
shr ebx, 1
cmp eax, ecx
jb @f
sub eax, ecx
@@:
shr edx, 2
}
lea ecx, [ebx + edx]
shr ebx, 1
cmp eax, ecx
jb @f
@@:

imul ebx, ebx
cmp ebx, [esp + 4 + 4]
sete al
movzx eax, al

pop ebx
ret 4
.false:
xor eax, eax
ret 4

.lut db 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0    ```
Any questions?
02 Sep 2011, 17:11
Goplat

Joined: 15 Sep 2006
Posts: 181
Goplat 02 Sep 2011, 17:54
Fanael: Ah, the good old integer square root algorithm But isn't the ending imul unnecessary? Since eax holds the "remainder" (original number minus ebx^2) you just need to check if it's 0. (after changing the last step to update eax, not ebx)
02 Sep 2011, 17:54
Fanael

Joined: 03 Jul 2009
Posts: 168
Fanael 02 Sep 2011, 18:50
Oh yeah, so there's the code without that multiplication, hopefully slightly faster (still not optimized much, constant propagation will help by freeing up EDX):
Code:
```align 16
IntOnly_v2:
mov eax, [esp + 4]
mov ecx, eax
and ecx, 0xF
test [ecx + .lut], -1
jz .false

mov edx, 0x40000000
push ebx
xor ebx, ebx
rept 15
{
lea ecx, [ebx + edx]
shr ebx, 1
cmp eax, ecx
jb @f
sub eax, ecx
@@:
shr edx, 2
}
lea ecx, [ebx + edx]
pop ebx
cmp eax, ecx
jb @f
sub eax, ecx
setz al
movzx eax, al
ret 4
@@:

test eax, eax
setz al
movzx eax, al
ret 4
.false:
xor eax, eax
ret 4

.lut db 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0    ```

EDIT: removed the unnecessary "shr ebx, 1".
Also, there's the v3, basically v2 + constant propagation:
Code:
```align 16
IntOnly_v3:
mov eax, [esp + 4]
mov ecx, eax
and ecx, 0xF
test [ecx + .lut], -1
jz .false

xor edx, edx
rept 15 n:0
{
lea ecx, [edx + (1 shl ((n xor 15)*2))]
shr edx, 1
cmp eax, ecx
jb @f
sub eax, ecx
add edx, (1 shl ((n xor 15)*2))
@@:
}
lea ecx, [edx + 1]
cmp eax, ecx
jb @f
sub eax, ecx
setz al
movzx eax, al
ret 4
@@:

test eax, eax
setz al
movzx eax, al
ret 4
.false:
xor eax, eax
ret 4

.lut db 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0    ```
02 Sep 2011, 18:50
edfed

Joined: 20 Feb 2006
Posts: 4283
Location: Now
edfed 05 Sep 2011, 10:02
in 32 bits, there are only 65536 perfect squares. then, only 65536 values to test.
no need to remember them.

just need to evaluate, like the analog to digital conversion.

a 32 bits perfect square is evaluable by the most significant set bit.

this bit is there because it was set by a sqrt(bit order).
bit0²=bit0
bit1²=bit2
bit2²=bit4
bit4²=bit8
etc...

then, if bit 16 is set, means that bit 8 is the root.
then, you have to evaluate the perfect square around our number to test.

bsr is there to find the most significant set bit. in order to know what squared number to test first.

Code:
```testperfectsquare:
;eax number to test
;return eax=root if ok, -1 if not ok
bsr ecx,eax
shr ecx,1
xor edx,edx
bts edx,ecx
mov ebx,edx
imul edx,ebx
cmp edx,eax
je .ok
jl .less
.more:
something like that, i am not at home, then, i can't test and code efficientlly.
.notok:
mov ecx,-1
.ok:
mov eax,ecx
ret
```
05 Sep 2011, 10:02
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 19097
revolution 05 Sep 2011, 11:25
edfed: Are you trying to reinvent the integer square root algorithm? I think you will find that isqrt is what you are implying.
05 Sep 2011, 11:25
edfed

Joined: 20 Feb 2006
Posts: 4283
Location: Now
edfed 05 Sep 2011, 11:42
yep!

i've found that if there is one, or more bits set to 0 between two bits set to 1, the square wil have more bits than root.
otherwise, if there is only one single contiguous 1's bitstream, there will be the same number of bit set in square than in root.

for example:
0111 gives 00110001, 3 bits for 3 bits.

0101 gives 00011001, 3 bits for 2 bits.

1100 gives 10010000, 2 bits for 2 bits

1101 gives 10101001, 4 bits for 3 bits.

i suppose that for any combinaison of bits with only one single contiguous 1s region, there will be exactlly the same number of 1s in the square. otherwise, there will be more 1s in the square than in the root;
05 Sep 2011, 11:42
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 19097
revolution 05 Sep 2011, 12:02
Code:
```isqrt_eax:
xor     ebx,ebx
xor     edx,edx
bsr     edi,eax
setnz   cl
sub     cl,1
shr     edi,cl
and     edi,-2
lea     ecx,[edi+2]
ror     eax,cl
.digit:   shld    ebx,eax,2
lea     ecx,[edx*4+1]
cmp     ebx,ecx
cmc
sbb     esi,esi
and     ecx,esi
sub     ebx,ecx
shl     eax,2
sub     edi,2
jns     .digit
mov     eax,edx
ret

isqrt_eax_p2: ;pentium II with cmovcc
xor     ebx,ebx
xor     edx,edx
bsr     edi,eax
cmovz   edi,eax
and     edi,-2
lea     ecx,[edi+2]
ror     eax,cl
.digit:   shld    ebx,eax,2
lea     ecx,[edx*4+1]
mov     esi,ebx
sub     ebx,ecx
cmovb   ebx,esi
cmc
shl     eax,2
sub     edi,2
jns     .digit
mov     eax,edx
ret

isqrt_eax_sse:
cvtsi2ss xmm0,eax
sqrtss  xmm0,xmm0
cvttss2si eax,xmm0
ret

isqrt_eax_sse2:
cvtsi2sd xmm0,eax
sqrtsd  xmm0,xmm0
cvttsd2si eax,xmm0
ret    ```
I have no idea which is faster.
05 Sep 2011, 12:02
edfed

Joined: 20 Feb 2006
Posts: 4283
Location: Now
edfed 05 Sep 2011, 14:51
the pattern of the first squares number is interresting.

first bit (bit 0) is equal to bit0 of the root, the pattern is 01
bit1 is always equal to 0
bit2 have the pattern 0010, and is equal to 1 when root and 3=2
bit3 have the pattern 00010100
etc...

square numbers are interresting.

Last edited by edfed on 18 Sep 2011, 20:52; edited 1 time in total
05 Sep 2011, 14:51
Enko

Joined: 03 Apr 2007
Posts: 676
Location: Mar del Plata
Enko 05 Sep 2011, 14:56
Perhaps its stupid, but why not to generate a 4gb table? (for 32bit integers)
Quote:

3:false
4:true
5:false
6:false
7:false
8:false
9:true
.......
16585484:false
16585484:false
etc...
05 Sep 2011, 14:56
 Display posts from previous: All Posts1 Day7 Days2 Weeks1 Month3 Months6 Months1 Year Oldest FirstNewest First

 Jump to: Select a forum Official----------------AssemblyPeripheria General----------------MainTutorials and ExamplesDOSWindowsLinuxUnixMenuetOS Specific----------------MacroinstructionsOS ConstructionIDE DevelopmentProjects and IdeasNon-x86 architecturesHigh Level LanguagesProgramming Language DesignCompiler Internals Other----------------FeedbackHeapTest Area
Goto page Previous  1, 2, 3, 4  Next

Forum Rules:
 You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot vote in polls in this forumYou cannot attach files in this forumYou can download files in this forum