flat assembler
Message board for the users of flat assembler.

Index > Main > Calculate if a number is equal to 1 mod 24 without using DIV

Goto page Previous  1, 2
Author
Thread Post new topic Reply to topic
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17716
Location: In your JS exploiting you and your system
revolution
edfed wrote:
with a 4G dwrods lut, you can obtain a relativelly fast division.
Maybe not. The cost of reading values from DRAM could surpass the time taken to compute a DIV. Each system will vary in its capabilities so you would need to test it and see which works best.
Post 27 May 2014, 15:21
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
I think edfed is on to something.

x MOD 24 == (x MOD 8 ) + ((x MOD 3) * 8 )
But we're stuck with MOD 3, which everyone hates.

If we made a LUT of bytes (0, 8, 16, ...) of length 2^29 that would be 500MB which is still too damn big.

So we need a smaller LUT, we could use POPCNT to help

Code:
LUT db 2,0,1,2,0,1,2,0,1,2,0,1,2,0,1,2,0,1,2,0,1,2,0,1,2,0,1,2,0,1,2,0,1
MAGIC dd 0AAAAAAAAh
...
Mod3:
MOV eax, [esp+4]
XOR eax, [MAGIC]
POPCNT ecx, eax
MOVZX al, [LUT + ecx]
RET 0
    


That's MOD 3.

The challenge now is figuring out a minimal MASK + POPCNT + LUT for MOD 24. FYI: Paolo Bonzin was a smart guy.

Code:
LUT db 16,0,8,16,0,8,16,0,8,16,0,8,16,0,8,16,0,8,16,0,8,16,0,8,16,0,8,16,0,8,16,0,8
MAGIC dd 0AAAAAAAAh
...
Mod24:
MOV eax, [esp+4] ;; = n
MOV edx, eax ;; = n
XOR eax, [MAGIC] ;; 
AND edx, 7 ;; = MOD 8
POPCNT ecx, eax ;; = Pop Count eax
MOVZX al, [LUT + ecx] ;; = 8 * (n MOD 3)
ADD eax, edx ;;  = 8 * (n MOD 3)  + (n MOD Cool
RET 0
    


Too many instructions JUST DIVIDE BY 24
Post 28 May 2014, 03:05
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
baldr



Joined: 19 Mar 2008
Posts: 1651
baldr
magicSqr,

Modular arithmetic can be used for this. Given that 0xAAAAAAAB is a multiplicative inverse of 3, i.e. 3·0xAAAAAAAB ≡ 1 (mod 2³²), and it's big enough, so x = 3n+1 means x·0xAAAAAAAB ≥ 0xAAAAAAAB (mod 2³²).

Rough measurements show that even dumb implementation runs around 60% faster than div (SU2300 @1.2GHz). Your mileage may vary.
Post 29 May 2014, 23:28
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17716
Location: In your JS exploiting you and your system
revolution
baldr wrote:
Rough measurements show that even dumb implementation runs around 60% faster than div (SU2300 @1.2GHz). Your mileage may vary.
Where is that figure from? Is it a real program or some synthetic test? Synthetic tests are useless here because of all the other overriding factors like cache usage and memory accesses.


Last edited by revolution on 30 May 2014, 11:05; edited 1 time in total
Post 29 May 2014, 23:44
View user's profile Send private message Visit poster's website Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4242
Location: 2018
edfed
what is needed is mod 24, then the factor isn't any stated above.

i am pretty sure there is an easy, fast and efficient way to compute that.... but how?
Post 30 May 2014, 09:29
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17716
Location: In your JS exploiting you and your system
revolution
edfed wrote:
what is needed is mod 24, then the factor isn't any stated above.
You are correct, but when using reciprocal multiplication you can't get the remainder without first getting the quotient.
edfed wrote:
i am pretty sure there is an easy, fast and efficient way to compute that.... but how?
Easy: DIV.
Post 30 May 2014, 09:33
View user's profile Send private message Visit poster's website Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4242
Location: 2018
edfed
Smile
Post 30 May 2014, 11:02
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17716
Location: In your JS exploiting you and your system
revolution
I had to edit my post above because I was misunderstanding what baldr was saying. I think I now understand what bald was suggesting. I hope I got this right:
Code:
mov rax,0xaaaaaaaaaaaaaaab ;ceiling(2^65/3)
mul [value]
add rax,rdx
cmp rax,[value]
jz .is_a_multiple_of_3
;not a multiple of 3
ud2
.is_a_multiple_of_3:
ud2    
And you can just subtract 1 before calling to check if it is 1 mod 3.

Can that be generalised to all values?
Post 30 May 2014, 11:21
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17716
Location: In your JS exploiting you and your system
revolution
Hmm, perhaps this is what is meant?
Code:
mov rax,[value]
mov rcx,0xaaaaaaaaaaaaaaab ;ceiling(2^65/3)
imul rax,rcx
cmp rax,rcx
jae .is_1_mod_3
;...    
Post 30 May 2014, 11:41
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
Seems like one of those things that only works with prime numbers or powers of 2 plus/minus 1. Have you tried with 2^65/24?
Post 30 May 2014, 13:19
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
baldr



Joined: 19 Mar 2008
Posts: 1651
baldr
revolution wrote:
Code:
mov rcx,0xaaaaaaaaaaaaaaab ;ceiling(2^65/3)    
Comment is somewhat misleading, magic numbers are different beasts from multiplicative inverses (though they can be the same sometimes). These things are definitely tricky to get right. Wink

magicSqr explicitly stated that neither quotient, nor remainder needed, only the test result is. Thus the least significant 32 bits of x·0xAAAAAAAB are compared to the threshold value in order to make a decision about divisibility by 3. Additionally that product is tested to have 011 in the least significant bits (since we're looking for 24n+1).
revolution wrote:
Where is that figure from? Is it a real program or some synthetic test?
How it's supposed to be real? Wink
I've used this (highly artificial) program to roughly compare different methods:
Code:
        format  PE console
        include "Win32A.Inc"

        section ".text" executable

        cinvoke cputs, _testing
        xor     ebx, ebx
testing:
        test    bx, bx          ; check for keypress sometimes
        jnz     .proceed
        cinvoke kbhit
        test    eax, eax
        jz      .proceed
 .flush:
        cinvoke getch
        cinvoke kbhit
        test    eax, eax
        jnz     .flush
        cinvoke cputs, _skipped
        jmp     benching

 .proceed:
        mov     eax, ebx
        call    test_div
        mov     edi, eax
        mov     eax, ebx
        call    test_m04
        cmp     eax, edi
        je      .next
        cinvoke cprintf, _failed, ebx, edi, eax
        invoke  ExitProcess, 1

 .next: add     ebx, 1
        jnz     testing
        cinvoke cputs, _ok

benching:
        cinvoke cputs, _benching

irps variant, div m01 m02 m03 m04 {
        xor     ebx, ebx
        invoke  GetTickCount
        mov     [_start], eax
@@:     mov     eax, ebx
        call    test_#variant
        add     ebx, 1
        jnz     @b
        invoke  GetTickCount
        sub     eax, [_start]
        cinvoke cprintf, _fmt, _#variant, eax
}
        invoke  ExitProcess, 0

        align   16
test_div:
        mov     ecx, 24
        xor     edx, edx
        div     ecx
        xor     eax, eax
        cmp     edx, 1
        sete    al              ; 'cmove eax, edx' works too, but depends on prev instruction
        ret

INVERTED_3 = 0xAAAAAAAB; multiplicative inverse of 3 (mod 2**32)

        align   16
test_m01:
        imul    eax, INVERTED_3
        cmp     eax, INVERTED_3 ; luckily it's test value too
        jb      .didn't_pass
        and     eax, 7
        cmp     eax, 3
        jne     .didn't_pass
        mov     eax, 1
        ret
.didn't_pass:
        mov     eax, 0
        ret

        align   16
test_m02:
        mov     ecx, eax
        and     ecx, 7
        cmp     ecx, 1
        jne     .didn't_pass
        imul    eax, INVERTED_3
        cmp     eax, INVERTED_3
        jb      .didn't_pass
        mov     eax, 1
        ret
.didn't_pass:
        mov     eax, 0
        ret

        align   16
test_m03:
        mov     ecx, INVERTED_3
        mul     ecx
        cmp     eax, ecx
        jb      .didn't_pass
        and     eax, 7
        cmp     eax, 3
        jne     .didn't_pass
        mov     eax, 1
        ret
.didn't_pass:
        mov     eax, 0
        ret

        align   16
test_m04:
        mov     ecx, INVERTED_3
        imul    eax, ecx
        sub     eax, ecx
        sbb     ecx, ecx        ; 0 if 3n+1, -1 otherwise
        and     eax, 7          ; 0 if 8m+1
        or      eax, ecx        ; 0 if 24k+1
        neg     eax
        sbb     eax, eax
        inc     eax
        ret

        section ".data" readable writable
_start rd 1
_testing db     "Testing for consistency...", 0
_ok     db      " OK", 0
_skipped db     " skipped", 0
_failed db      " failed @%#x: div: %d, mul: %d.", 13, 10, 0
_benching db    ".", 13, 10, "Benchmarking...", 13, 10, 0

irps variant, div m01 m02 m03 m04 { _#variant db `variant, 0 }
_fmt    db      "%s(): %d ticks.", 13, 10, 0

        section ".idata" readable import
        library kernel32, "Kernel32.DLL",\
                msvcrt, "MSVCRT.DLL"

        include "API/Kernel32.Inc"
        import  msvcrt,\
                cprintf, "_cprintf",\
                cputs, "_cputs",\
                getch, "_getch",\
                kbhit, "_kbhit"    
The results were similar to the following:
Code:
>"mul vs. div.exe"
Testing for consistency... skipped.
Benchmarking...
div(): 48516 ticks.
m01(): 31933 ticks.
m02(): 28954 ticks.
m03(): 28517 ticks.
m04(): 25693 ticks.    
My remark about 60% speedup was semi-correct: div routine spended ~50k ticks, m01 got ~31k, so div was ~61% slower than mul (50/31-1 ≈ 0.61), but mul was only 38% faster than div (1-31/50 ≈ 0.38).
Post 30 May 2014, 13:48
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17716
Location: In your JS exploiting you and your system
revolution
r22 wrote:
Seems like one of those things that only works with prime numbers or powers of 2 plus/minus 1. Have you tried with 2^65/24?
Wouldn't work like that. You just need to test for the factors three and eight separately. Dividing by eight is easy: shr reg,3
Post 30 May 2014, 13:57
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17716
Location: In your JS exploiting you and your system
revolution
baldr wrote:
<Some code and analysis>
Thanks the the clarification.

Regarding the speed test: I'm sure you are aware that 32-bit test code timing is not going to be applicable to 64-bit run code.
Post 30 May 2014, 13:59
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17716
Location: In your JS exploiting you and your system
revolution
I converted the code to 64-bit and ran on an AMD fusion under W7-64.
Code:
        format  PE64 console
        include "Win64A.Inc"

        section ".text" executable

entry $
        and     rsp,-16
        cinvoke cputs, _testing
        xor     ebx, ebx
testing:
        test    bx, bx          ; check for keypress sometimes
        jnz     .proceed
        cinvoke kbhit
        test    eax, eax
        jz      .proceed
 .flush:
        cinvoke getch
        cinvoke kbhit
        test    eax, eax
        jnz     .flush
        cinvoke cputs, _skipped
        jmp     benching

 .proceed:
        mov     eax, ebx
        call    test_div
        mov     edi, eax
        mov     eax, ebx
        call    test_m04
        cmp     eax, edi
        je      .next
        cinvoke cprintf, _failed, ebx, edi, eax
        invoke  ExitProcess, 1

 .next: add     ebx, 1
        jnz     testing
        cinvoke cputs, _ok

benching:
        cinvoke cputs, _benching

irps variant, div m01 m02 m03 m04 {
        xor     ebx, ebx
        invoke  GetTickCount
        mov     [_start], rax
@@:     mov     eax, ebx
        call    test_#variant
        add     ebx, 1
        test    ebx,1 shl 28 -1
        jnz     @b
        invoke  GetTickCount
        sub     rax, [_start]
        cinvoke cprintf, _fmt, _#variant, rax
}
        invoke  ExitProcess, 0

        align   16
test_div:
        mov     ecx, 24
        xor     edx, edx
        div     rcx
        xor     eax, eax
        cmp     edx, 1
        sete    al              ; 'cmove eax, edx' works too, but depends on prev instruction
        ret

INVERTED_3 = 0xAAAAAAAAAAAAAAAB; multiplicative inverse of 3 (mod 2**64)

        align   16
test_m01:
        mov     rcx, INVERTED_3
        imul    rax, rcx
        cmp     rax, rcx
        jb      .didn't_pass
        and     eax, 7
        cmp     eax, 3
        jne     .didn't_pass
        mov     eax, 1
        ret
.didn't_pass:
        mov     eax, 0
        ret

        align   16
test_m02:
        mov     rcx, rax
        and     ecx, 7
        cmp     ecx, 1
        jne     .didn't_pass
        mov     rcx, INVERTED_3
        imul    rax, rcx
        cmp     rax, rcx
        jb      .didn't_pass
        mov     eax, 1
        ret
.didn't_pass:
        mov     eax, 0
        ret

        align   16
test_m03:
        mov     rcx, INVERTED_3
        mul     rcx
        cmp     rax, rcx
        jb      .didn't_pass
        and     eax, 7
        cmp     eax, 3
        jne     .didn't_pass
        mov     eax, 1
        ret
.didn't_pass:
        mov     eax, 0
        ret

        align   16
test_m04:
        mov     rcx, INVERTED_3
        imul    rax, rcx
        sub     rax, rcx
        sbb     ecx, ecx        ; 0 if 3n+1, -1 otherwise
        and     eax, 7          ; 0 if 8m+1
        or      eax, ecx        ; 0 if 24k+1
        neg     eax
        sbb     eax, eax
        inc     eax
        ret

        section ".data" readable writable
_start rq 1
_testing db     "Testing for consistency...", 0
_ok     db      " OK", 0
_skipped db     " skipped", 0
_failed db      " failed @%#x: div: %d, mul: %d.", 13, 10, 0
_benching db    ".", 13, 10, "Benchmarking...", 13, 10, 0

irps variant, div m01 m02 m03 m04 { _#variant db `variant, 0 }
_fmt    db      "%s(): %I64u ticks.", 13, 10, 0

        section ".idata" readable import
        library kernel32, "Kernel32.DLL",\
                msvcrt, "MSVCRT.DLL"

        include "API/Kernel32.Inc"
        import  msvcrt,\
                cprintf, "_cprintf",\
                cputs, "_cputs",\
                getch, "_getch",\
                kbhit, "_kbhit"    
And the result:
Code:
Benchmarking...
div(): 15116 ticks.
m01(): 2122 ticks.
m02(): 1809 ticks.
m03(): 2091 ticks.
m04(): 1575 ticks.    
I am shocked that this label is valid ".didn't_pass" Shocked
Post 30 May 2014, 14:14
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
Facinating! I had convinced myself that having to combine/composite the mod3 and the AND7 (mod8) results wouldn't be able to compete with a DIV. These synthetic results are preconception shattering; now I'm curious if an optimizing compiler like GCC would come up with similar results.
Post 30 May 2014, 14:31
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17716
Location: In your JS exploiting you and your system
revolution
r22 wrote:
I'm curious if an optimizing compiler like GCC would come up with similar results.
The use case is tiny. I would hope that HLL compiler writers are spending their time on more important things like vectorisation, prefetching and cache awareness.
Post 30 May 2014, 14:41
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
Code:
VARIABLE % CONSTANT_A == CONSTANT_B ? true : false;
    

Seems broad enough that the magic black box (and the wizards that summon it from the ether) would want to optimize it appropriately. I'm just curious if the code path it chooses would look like any of the ones baldr presented.
Post 30 May 2014, 15:45
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
Here's what GCC 4.6.3 comes up with for
Code:
int main() {
    unsigned hi, lo;
    unsigned long long num;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    num = ((unsigned long long)lo)|( ((unsigned long long)hi)<<32);
    if((num % 24) == 1) {
        return 1;
    } else {
        return 0;
    }
}
    

Code:
gcc -m64 -Ofast -S -masm=intel main.c
    

Code:
rdtsc
        mov     rcx, rdx
        mov     eax, eax
        movabs  rdx, -6148914691236517205
        sal     rcx, 32
        or      rcx, rax
        mov     rax, rcx
        mul     rdx
        shr     rdx, 4
        lea     rax, [rdx+rdx*2]
        sal     rax, 3
        sub     rcx, rax
        xor     eax, eax
        cmp     rcx, 1
        sete    al
        ret
    

Didn't expect the SAL.
Post 06 Jun 2014, 01:57
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17716
Location: In your JS exploiting you and your system
revolution
r22 wrote:
Here's what GCC 4.6.3 comes up with for ...
Code:
;...
        mov     eax, eax    
Clever.
r22 wrote:
Code:
        movabs  rdx, -6148914691236517205    
0xAAA...AAB. But why "movabs"?

So just your basic recip-mul there.
Post 06 Jun 2014, 03:29
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.