flat assembler
Message board for the users of flat assembler.
Index
> Main > Calculate if a number is equal to 1 mod 24 without using DIV Goto page Previous 1, 2 |
Author |
|
r22 28 May 2014, 03:05
I think edfed is on to something.
x MOD 24 == (x MOD 8 ) + ((x MOD 3) * 8 ) But we're stuck with MOD 3, which everyone hates. If we made a LUT of bytes (0, 8, 16, ...) of length 2^29 that would be 500MB which is still too damn big. So we need a smaller LUT, we could use POPCNT to help Code: LUT db 2,0,1,2,0,1,2,0,1,2,0,1,2,0,1,2,0,1,2,0,1,2,0,1,2,0,1,2,0,1,2,0,1 MAGIC dd 0AAAAAAAAh ... Mod3: MOV eax, [esp+4] XOR eax, [MAGIC] POPCNT ecx, eax MOVZX al, [LUT + ecx] RET 0 That's MOD 3. The challenge now is figuring out a minimal MASK + POPCNT + LUT for MOD 24. FYI: Paolo Bonzin was a smart guy. Code: LUT db 16,0,8,16,0,8,16,0,8,16,0,8,16,0,8,16,0,8,16,0,8,16,0,8,16,0,8,16,0,8,16,0,8 MAGIC dd 0AAAAAAAAh ... Mod24: MOV eax, [esp+4] ;; = n MOV edx, eax ;; = n XOR eax, [MAGIC] ;; AND edx, 7 ;; = MOD 8 POPCNT ecx, eax ;; = Pop Count eax MOVZX al, [LUT + ecx] ;; = 8 * (n MOD 3) ADD eax, edx ;; = 8 * (n MOD 3) + (n MOD RET 0 Too many instructions JUST DIVIDE BY 24 |
|||
28 May 2014, 03:05 |
|
baldr 29 May 2014, 23:28
magicSqr,
Modular arithmetic can be used for this. Given that 0xAAAAAAAB is a multiplicative inverse of 3, i.e. 3·0xAAAAAAAB ≡ 1 (mod 2³²), and it's big enough, so x = 3n+1 means x·0xAAAAAAAB ≥ 0xAAAAAAAB (mod 2³²). Rough measurements show that even dumb implementation runs around 60% faster than div (SU2300 @1.2GHz). Your mileage may vary. |
|||
29 May 2014, 23:28 |
|
revolution 29 May 2014, 23:44
baldr wrote: Rough measurements show that even dumb implementation runs around 60% faster than div (SU2300 @1.2GHz). Your mileage may vary. Last edited by revolution on 30 May 2014, 11:05; edited 1 time in total |
|||
29 May 2014, 23:44 |
|
edfed 30 May 2014, 09:29
what is needed is mod 24, then the factor isn't any stated above.
i am pretty sure there is an easy, fast and efficient way to compute that.... but how? |
|||
30 May 2014, 09:29 |
|
revolution 30 May 2014, 09:33
edfed wrote: what is needed is mod 24, then the factor isn't any stated above. edfed wrote: i am pretty sure there is an easy, fast and efficient way to compute that.... but how? |
|||
30 May 2014, 09:33 |
|
edfed 30 May 2014, 11:02
|
|||
30 May 2014, 11:02 |
|
revolution 30 May 2014, 11:21
I had to edit my post above because I was misunderstanding what baldr was saying. I think I now understand what bald was suggesting. I hope I got this right:
Code: mov rax,0xaaaaaaaaaaaaaaab ;ceiling(2^65/3) mul [value] add rax,rdx cmp rax,[value] jz .is_a_multiple_of_3 ;not a multiple of 3 ud2 .is_a_multiple_of_3: ud2 Can that be generalised to all values? |
|||
30 May 2014, 11:21 |
|
revolution 30 May 2014, 11:41
Hmm, perhaps this is what is meant?
Code: mov rax,[value] mov rcx,0xaaaaaaaaaaaaaaab ;ceiling(2^65/3) imul rax,rcx cmp rax,rcx jae .is_1_mod_3 ;... |
|||
30 May 2014, 11:41 |
|
r22 30 May 2014, 13:19
Seems like one of those things that only works with prime numbers or powers of 2 plus/minus 1. Have you tried with 2^65/24?
|
|||
30 May 2014, 13:19 |
|
baldr 30 May 2014, 13:48
revolution wrote:
magicSqr explicitly stated that neither quotient, nor remainder needed, only the test result is. Thus the least significant 32 bits of x·0xAAAAAAAB are compared to the threshold value in order to make a decision about divisibility by 3. Additionally that product is tested to have 011 in the least significant bits (since we're looking for 24n+1). revolution wrote: Where is that figure from? Is it a real program or some synthetic test? I've used this (highly artificial) program to roughly compare different methods: Code: format PE console include "Win32A.Inc" section ".text" executable cinvoke cputs, _testing xor ebx, ebx testing: test bx, bx ; check for keypress sometimes jnz .proceed cinvoke kbhit test eax, eax jz .proceed .flush: cinvoke getch cinvoke kbhit test eax, eax jnz .flush cinvoke cputs, _skipped jmp benching .proceed: mov eax, ebx call test_div mov edi, eax mov eax, ebx call test_m04 cmp eax, edi je .next cinvoke cprintf, _failed, ebx, edi, eax invoke ExitProcess, 1 .next: add ebx, 1 jnz testing cinvoke cputs, _ok benching: cinvoke cputs, _benching irps variant, div m01 m02 m03 m04 { xor ebx, ebx invoke GetTickCount mov [_start], eax @@: mov eax, ebx call test_#variant add ebx, 1 jnz @b invoke GetTickCount sub eax, [_start] cinvoke cprintf, _fmt, _#variant, eax } invoke ExitProcess, 0 align 16 test_div: mov ecx, 24 xor edx, edx div ecx xor eax, eax cmp edx, 1 sete al ; 'cmove eax, edx' works too, but depends on prev instruction ret INVERTED_3 = 0xAAAAAAAB; multiplicative inverse of 3 (mod 2**32) align 16 test_m01: imul eax, INVERTED_3 cmp eax, INVERTED_3 ; luckily it's test value too jb .didn't_pass and eax, 7 cmp eax, 3 jne .didn't_pass mov eax, 1 ret .didn't_pass: mov eax, 0 ret align 16 test_m02: mov ecx, eax and ecx, 7 cmp ecx, 1 jne .didn't_pass imul eax, INVERTED_3 cmp eax, INVERTED_3 jb .didn't_pass mov eax, 1 ret .didn't_pass: mov eax, 0 ret align 16 test_m03: mov ecx, INVERTED_3 mul ecx cmp eax, ecx jb .didn't_pass and eax, 7 cmp eax, 3 jne .didn't_pass mov eax, 1 ret .didn't_pass: mov eax, 0 ret align 16 test_m04: mov ecx, INVERTED_3 imul eax, ecx sub eax, ecx sbb ecx, ecx ; 0 if 3n+1, -1 otherwise and eax, 7 ; 0 if 8m+1 or eax, ecx ; 0 if 24k+1 neg eax sbb eax, eax inc eax ret section ".data" readable writable _start rd 1 _testing db "Testing for consistency...", 0 _ok db " OK", 0 _skipped db " skipped", 0 _failed db " failed @%#x: div: %d, mul: %d.", 13, 10, 0 _benching db ".", 13, 10, "Benchmarking...", 13, 10, 0 irps variant, div m01 m02 m03 m04 { _#variant db `variant, 0 } _fmt db "%s(): %d ticks.", 13, 10, 0 section ".idata" readable import library kernel32, "Kernel32.DLL",\ msvcrt, "MSVCRT.DLL" include "API/Kernel32.Inc" import msvcrt,\ cprintf, "_cprintf",\ cputs, "_cputs",\ getch, "_getch",\ kbhit, "_kbhit" Code: >"mul vs. div.exe" Testing for consistency... skipped. Benchmarking... div(): 48516 ticks. m01(): 31933 ticks. m02(): 28954 ticks. m03(): 28517 ticks. m04(): 25693 ticks. |
|||
30 May 2014, 13:48 |
|
revolution 30 May 2014, 13:57
r22 wrote: Seems like one of those things that only works with prime numbers or powers of 2 plus/minus 1. Have you tried with 2^65/24? |
|||
30 May 2014, 13:57 |
|
revolution 30 May 2014, 13:59
baldr wrote: <Some code and analysis> Regarding the speed test: I'm sure you are aware that 32-bit test code timing is not going to be applicable to 64-bit run code. |
|||
30 May 2014, 13:59 |
|
revolution 30 May 2014, 14:14
I converted the code to 64-bit and ran on an AMD fusion under W7-64.
Code: format PE64 console include "Win64A.Inc" section ".text" executable entry $ and rsp,-16 cinvoke cputs, _testing xor ebx, ebx testing: test bx, bx ; check for keypress sometimes jnz .proceed cinvoke kbhit test eax, eax jz .proceed .flush: cinvoke getch cinvoke kbhit test eax, eax jnz .flush cinvoke cputs, _skipped jmp benching .proceed: mov eax, ebx call test_div mov edi, eax mov eax, ebx call test_m04 cmp eax, edi je .next cinvoke cprintf, _failed, ebx, edi, eax invoke ExitProcess, 1 .next: add ebx, 1 jnz testing cinvoke cputs, _ok benching: cinvoke cputs, _benching irps variant, div m01 m02 m03 m04 { xor ebx, ebx invoke GetTickCount mov [_start], rax @@: mov eax, ebx call test_#variant add ebx, 1 test ebx,1 shl 28 -1 jnz @b invoke GetTickCount sub rax, [_start] cinvoke cprintf, _fmt, _#variant, rax } invoke ExitProcess, 0 align 16 test_div: mov ecx, 24 xor edx, edx div rcx xor eax, eax cmp edx, 1 sete al ; 'cmove eax, edx' works too, but depends on prev instruction ret INVERTED_3 = 0xAAAAAAAAAAAAAAAB; multiplicative inverse of 3 (mod 2**64) align 16 test_m01: mov rcx, INVERTED_3 imul rax, rcx cmp rax, rcx jb .didn't_pass and eax, 7 cmp eax, 3 jne .didn't_pass mov eax, 1 ret .didn't_pass: mov eax, 0 ret align 16 test_m02: mov rcx, rax and ecx, 7 cmp ecx, 1 jne .didn't_pass mov rcx, INVERTED_3 imul rax, rcx cmp rax, rcx jb .didn't_pass mov eax, 1 ret .didn't_pass: mov eax, 0 ret align 16 test_m03: mov rcx, INVERTED_3 mul rcx cmp rax, rcx jb .didn't_pass and eax, 7 cmp eax, 3 jne .didn't_pass mov eax, 1 ret .didn't_pass: mov eax, 0 ret align 16 test_m04: mov rcx, INVERTED_3 imul rax, rcx sub rax, rcx sbb ecx, ecx ; 0 if 3n+1, -1 otherwise and eax, 7 ; 0 if 8m+1 or eax, ecx ; 0 if 24k+1 neg eax sbb eax, eax inc eax ret section ".data" readable writable _start rq 1 _testing db "Testing for consistency...", 0 _ok db " OK", 0 _skipped db " skipped", 0 _failed db " failed @%#x: div: %d, mul: %d.", 13, 10, 0 _benching db ".", 13, 10, "Benchmarking...", 13, 10, 0 irps variant, div m01 m02 m03 m04 { _#variant db `variant, 0 } _fmt db "%s(): %I64u ticks.", 13, 10, 0 section ".idata" readable import library kernel32, "Kernel32.DLL",\ msvcrt, "MSVCRT.DLL" include "API/Kernel32.Inc" import msvcrt,\ cprintf, "_cprintf",\ cputs, "_cputs",\ getch, "_getch",\ kbhit, "_kbhit" Code: Benchmarking... div(): 15116 ticks. m01(): 2122 ticks. m02(): 1809 ticks. m03(): 2091 ticks. m04(): 1575 ticks. |
|||
30 May 2014, 14:14 |
|
r22 30 May 2014, 14:31
Facinating! I had convinced myself that having to combine/composite the mod3 and the AND7 (mod8) results wouldn't be able to compete with a DIV. These synthetic results are preconception shattering; now I'm curious if an optimizing compiler like GCC would come up with similar results.
|
|||
30 May 2014, 14:31 |
|
revolution 30 May 2014, 14:41
r22 wrote: I'm curious if an optimizing compiler like GCC would come up with similar results. |
|||
30 May 2014, 14:41 |
|
r22 30 May 2014, 15:45
Code: VARIABLE % CONSTANT_A == CONSTANT_B ? true : false; Seems broad enough that the magic black box (and the wizards that summon it from the ether) would want to optimize it appropriately. I'm just curious if the code path it chooses would look like any of the ones baldr presented. |
|||
30 May 2014, 15:45 |
|
r22 06 Jun 2014, 01:57
Here's what GCC 4.6.3 comes up with for
Code: int main() { unsigned hi, lo; unsigned long long num; __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi)); num = ((unsigned long long)lo)|( ((unsigned long long)hi)<<32); if((num % 24) == 1) { return 1; } else { return 0; } } Code: gcc -m64 -Ofast -S -masm=intel main.c Code: rdtsc mov rcx, rdx mov eax, eax movabs rdx, -6148914691236517205 sal rcx, 32 or rcx, rax mov rax, rcx mul rdx shr rdx, 4 lea rax, [rdx+rdx*2] sal rax, 3 sub rcx, rax xor eax, eax cmp rcx, 1 sete al ret Didn't expect the SAL. |
|||
06 Jun 2014, 01:57 |
|
revolution 06 Jun 2014, 03:29
r22 wrote: Here's what GCC 4.6.3 comes up with for ... r22 wrote:
So just your basic recip-mul there. |
|||
06 Jun 2014, 03:29 |
|
Goto page Previous 1, 2 < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.