flat assembler
Message board for the users of flat assembler.
![]() Goto page 1, 2 Next |
Author |
|
bitRAKE 21 Mar 2025, 13:13
Typically, in practical use, only a couple primes are needed - usually as part of a larger algorithm and during program initialization. We can get away with a much simpler function than if we were trying to list a bunch of primes.
Code: ; Reduce RCX if not prime (or two for 0 and 1). (no data version) find_prime: cmp rcx, 3 ; Special cases are clamped to prime two. jc .two bt ecx, 0 ; Adjust for even numbers, and counteract adc rcx, 1 ; below subtraction by two. jc .over ; Two overflow case require special handling. .next_candidate: push 1 push 1 pop r9 ; N² pop r8 ; N, divisor sub rcx, 2 ; prior odd canidate .try: lea rdx, [4*r8 + 4] ; next odd square, (N+2)² = N² +4N +4 sub r8, -2 ; next odd number, N+2 add r9, rdx jc .prime mov rax, rcx xor edx, edx cmp rcx, r9 ; If the square of the divisor exceeds jc .prime ; the number it must be prime. div r8 test rdx, rdx jnz .try cmp rcx, 3+1 jnc .next_candidate push 3 pop rcx .prime: retn .two: push 2 pop rcx retn .over: push (1 shl 64) - 59 ; largest prime < 2^64 pop rcx retn ; ECX is always prime, CF=1 At the other end of the spectrum is the prime sieve. |
|||
![]() |
|
bitRAKE 22 Mar 2025, 20:51
That's essentially it, but I've halved the loop and it works bottom up - because of the likelihood of divisibility by smaller numbers. Working bottom up also requires an upper bound - the square root of the candidate number.
The prime/2 plus the prime gap times the first divisor determines how long it takes to find a prime. This can be substantial. Code: or rbx, -1 .more: mov rcx, rbx call find_prime ; The `wsprintf` function supports the "I64" prefix for use with 64-bit ; values - otherwise they will be truncated. (Use with: d, i, o, u, x, or X.) mov r8, rcx wsprintfA & .buffer, <A "%I64u, ">, r8 xchg r8d, eax ; characters WriteConsoleA [.hStdOut], & .buffer, r8d, 0, 0 shr rbx, 1 jnz .more Wikipedia wrote: For example, there is no composite < 2^64 that is a strong pseudoprime to all of the seven bases 2, 325, 9375, 28178, 450775, 9780504, and 1795265022. |
|||
![]() |
|
bitRAKE 23 Mar 2025, 21:54
Understanding the Prime-Finding Algorithm: From Theory to Implementation
When developing algorithms to find prime numbers, we often begin with simple approaches and refine them through several optimizations. The assembly code provided is a sophisticated implementation of prime number detection, incorporating multiple clever optimizations. Let's break down how this algorithm was developed, starting from first principles. The Fundamental Concept: Testing for Primality A prime number is defined as a natural number greater than 1 that cannot be formed by multiplying two smaller natural numbers. The most straightforward approach to determine if a number is prime is to check if it's divisible by any number from 2 up to the number itself (excluding 1 and itself). However, we can immediately make two key observations:
Initial Optimization: Square Root Boundary The first significant optimization comes from mathematics: if a number n is not prime, it must have a divisor less than or equal to √n. This dramatically reduces our search space. In our assembly code, this principle appears as: Code: cmp rcx, r9 ; If the square of the divisor exceeds jc .prime ; the number it must be prime. Arithmetic Progression Optimization When testing consecutive odd numbers, calculating the square of each divisor would be inefficient. Instead, the algorithm uses an arithmetic progression to track the squares: Code: lea rdx, [4*r8 + 4] ; next odd square, (N+2)² = N² +4N +4 sub r8, -2 ; next odd number, N+2 add r9, rdx This clever approach avoids multiplication operations. If we have the square of n, we can compute the square of n+2 using: (n+2)² = n² + 4n + 4 By tracking both the current divisor (n) and its square (n²), we can efficiently generate the next values using addition. Focusing on Odd Numbers Another key optimization is testing only odd divisors (except for the special case of 2): Code: bt ecx, 0 ; Adjust for even numbers, and counteract adc rcx, 1 ; below subtraction by two. This effectively halves the number of divisibility tests needed. The algorithm uses bit testing (bt) to check if the number is even, and adjusts accordingly. Handling Edge Cases The algorithm carefully handles several edge cases: 1. Numbers less than 3 are clamped to the prime 2: Code: cmp rcx, 3 ; Special cases are clamped to prime two. jc .two 2. Overflow conditions are properly managed: Code: jc .over ; Two overflow case require special handling. 3. The largest prime less than 2⁶⁴ is returned for overflow cases: Code: .over: push (1 shl 64) - 59 ; largest prime < 2^64 pop rcx retn The Complete Algorithm Flow
Efficiency Considerations The algorithm makes excellent use of register operations and minimizes memory access. By using arithmetic progressions instead of multiplication, it avoids costly operations. The focus on odd numbers immediately halves the work compared to naive approaches. Final Thoughts This assembly implementation demonstrates how mathematical understanding combined with low-level optimization techniques can produce highly efficient algorithms. The progression from the basic primality test to this optimized version shows the value of refining algorithms through mathematical insights and careful implementation. What makes this algorithm particularly elegant is how it combines theoretical insights (like the square root boundary and odd-number testing) with practical assembly-level optimizations like register usage and arithmetic progressions. |
|||
![]() |
|
six_L 24 Mar 2025, 17:08
Hello,bitRAKE
interesting the algorithm. Quote: 97, 97, 97, 97, 89, 89, 89, 89, 89, 89, input 100, why did output many repetitive primes? |
|||
![]() |
|
Feryno 24 Mar 2025, 17:39
Great example and great explanation!
It is optimized for minimal size of the code. If someone needs speed optimization, the loops should begin at 16 bytes boundary and also the used push/pop instructions are slower than a mov instruction. Something like this may run faster in some situations - when the loops are repeated a lot of times: Code: align 10h find_prime: cmp rcx, 3 ; Special cases are clamped to prime two. jc .two bt ecx, 0 ; Adjust for even numbers, and counteract adc rcx, 1 ; below subtraction by two. jc .over ; Two overflow case require special handling. mov r10d, 1 ; processor zero extends this mov from 32 to 64 bits align 10h .next_candidate: mov r9d, r10d ; N² mov r8d, r10d ; N, divisor sub rcx, 2 ; prior odd candidate align 10h .try: lea rdx, [4*r8 + 4] ; next odd square, (N+2)² = N² +4N +4 IIRC moves like mov r10d, 1 are faster than mov r10, 1 and then perhaps mov r8d, r10d faster than mov r8, r10 but I'm not sure about this. Moves that do not require REX prefix (like mov edx, eax) are 1 byte smaller than those which do require it (mov r8d, eax) - I'm sure at least in this. When the alignment adds a lot of dummy NOP instructions, the code size increases and then the CPU uses more time to load it into cache so in some situations smaller code with unaligned loops runs faster. I remember a calculation of all primes upto some number using Eratosthenes sieve which I did decades ago and used maybe 8 or 16 MB of memory for the sieve (half of the RAM installed in the machine). I did that in not too much clever way because storing all numbers as a bit array, I could double the calculation range when storing only odd numbers. |
|||
![]() |
|
bitRAKE 24 Mar 2025, 21:11
six_L wrote: input 100, why did output many repetitive primes? That's also the reason I've coded it for size - to amplify the fact it is not for generating a list of primes. Last edited by bitRAKE on 25 Mar 2025, 01:15; edited 3 times in total |
|||
![]() |
|
bitRAKE 24 Mar 2025, 21:15
Feryno wrote: Great example and great explanation! Is there a difference between NOP spacing the code, and just using sufficient encodings to get the alignment? fasm2 allows such flexibility, but I haven't coded this for speed, per se. |
|||
![]() |
|
bitRAKE 24 Mar 2025, 23:15
Claude 3.7 missed a couple things of note: the numbers 3-8 directly return the close odd primes: 3,5,7 - without division. I count that as a corner case as well. The carry being set is superfluous, but a product of my design methodology - allowing the algorithm to be adapted to a number of scenarios. For example, I don't see any point in returning 0|1. Yet, a use might need to error on their use - easy to make that change.
|
|||
![]() |
|
macomics 25 Mar 2025, 03:42
@bitRAKE & @Feryno, this is how it happens to align start of the loop to 16 bytes.
Code: ; Reduce RCX if not prime (or two for 0 and 1). (no data version) align 16 find_prime: ; cmp rcx, 3 ; Special cases are clamped to prime two. db 0x48, 0x81, 0xF9, 0x03, 0x00, 0x00, 0x00 jc .two push r10 ; push 1 ; pop r10 db 0x49, 0xBA, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 bt ecx, 0 ; Adjust for even numbers, and counteract adc rcx, r10 ; below subtraction by two. jc .over ; Two overflow case require special handling. xchg ax, ax ; 2 bytes NOP ; align 16 .next_candidate: mov r8, r10 ; N, divisor mov r9, r10 ; N² ; sub rcx, 2 ; prior odd canidate db 0x48, 0x81, 0xE9, 0x02, 0x00, 0x00, 0x00 db 0x66, 0x67, 0x90 ; 3 bytes NOP ; align 16 .try: lea rdx, [4 * r8 + 4] ; next odd square, (N+2)² = N² +4N +4 add r8, 2 ; next odd number, N+2 add r9, rdx jc .prime mov rax, rcx xor edx, edx cmp rcx, r9 ; If the square of the divisor exceeds jc .prime ; the number it must be prime. div r8 test rdx, rdx jnz .try cmp rcx, 3+1 jnc .next_candidate lea rcx, [3 * r10] .prime: pop r10 retn .over: push -59 ; (1 shl 64) - 59 ; largest prime < 2^64 pop rcx pop r10 retn .two: push 2 pop rcx retn ; ECX is always prime, CF=1 |
|||
![]() |
|
macomics 25 Mar 2025, 04:11
without r10
Code: ; Reduce RCX if not prime (or two for 0 and 1). (no data version) align 16 find_prime: cmp rcx, 3 ; Special cases are clamped to prime two. jc .two bt ecx, 0 ; Adjust for even numbers, and counteract adc rcx, 1 ; below subtraction by two. jc .over ; Two overflow case require special handling. ; align 16 .next: mov r8, 1 ; N, divisor mov r9, r8 ; N² sub rcx, 2 ; prior odd canidate xchg ax, ax ; 2 bytes NOP ; without NOP ; sub rcx, r8 ; sub rcx, r9 ; align 16 .try: lea rdx, [4 * r8 + 4] ; next odd square, (N+2)² = N² +4N +4 add r8, 2 ; next odd number, N+2 add r9, rdx jc .prime mov rax, rcx xor edx, edx cmp rcx, r9 ; If the square of the divisor exceeds jc .prime ; the number it must be prime. div r8 test rdx, rdx jnz .try cmp rcx, 3+1 jnc .next mov rcx, 3 ; align 16 .prime: retn align 16 .over: ; push -59 ; (1 shl 64) - 59 ; largest prime < 2^64 ; pop rcx mov rcx, -59 retn align 16 .two: mov rcx, 2 retn ; ECX is always prime, CF=1 |
|||
![]() |
|
six_L 25 Mar 2025, 09:59
Hello,bitRAKE
Thanks for your respones. when I input (2^64 - 1 = 0FFFFFFFFFFFFFFFEh = 18446744073709551614d ), It was crashed. Regards six_L |
|||
![]() |
|
bitRAKE 25 Mar 2025, 10:34
six_L wrote: when I input (2^64 - 1 = 0FFFFFFFFFFFFFFFEh = 18446744073709551614d ), It was crashed. 2^64-1 = 0xFFFFFFFFFFFFFFFF, which overflows, returning 2^64 - 59 0xFFFFFFFFFFFFFFFE, gets processed as 0xFFFFFFFFFFFFFFFD, which is divisible by 13, ... finally it returns 2^64 - 59. I have no clue what you are experiencing. Believe it or not, I did write test cases for this algorithm - just as a sanity check: Code: ; https://en.wikipedia.org/wiki/Prime-counting_function ; Some known values, primes under a power of ten: {const:8} label π10_Table:8 .temp = 1 iterate p,\ 4,\ 25,\ 168,\ 1229,\ 9592,\ 78498,\ 664579,\ 5761455,\ 50847534,\ 455052511,\ 4118054813,\ 37607912018,\ 346065536839,\ 3204941750802,\ 29844570422669,\ 279238341033925,\ 2623557157654233,\ 24739954287740860,\ 234057667276344607,\ 2220819602560918840,\ 21127269486018731928,\ 201467286689315906290,\ 1925320391606803968923,\ 18435599767349200867866,\ 176846309399143769411680,\ 1699246750872437141327603,\ 16352460426841680446427399,\ 157589269275973410412739598,\ 1520698109714272166094258063 .temp = temp * 10 if .temp < 1 shl 64 {const:8} dq p else break end if end iterate Code: {const:16} label Two_Table:1 ; Visit the Prime Pages: https://t5k.org/lists/2small/0bit.html ; For given <power> of two, the negative <offsets> are the distance to primes: ; (i.e. 2^8 - 5 = 251 is prime.) iterate <power, offsets>,\ 5, <1,3,9,13,15,19,21,25,27,29>,\ 6, <3,5,11,17,21,23,27,33,35,41>,\ 7, <1,15,19,21,25,27,31,39,45,49>,\ 8, <5,15,17,23,27,29,33,45,57,59>,\ 9, <3,9,13,21,25,33,45,49,51,55>,\ 10, <3,5,11,15,27,33,41,47,53,57>,\ 11, <9,19,21,31,37,45,49,51,55,61>,\ 12, <3,5,17,23,39,45,47,69,75,77>,\ 13, <1,13,21,25,31,45,69,75,81,91>,\ 14, <3,15,21,23,35,45,51,65,83,111>,\ 15, <19,49,51,55,61,75,81,115,121,135>,\ 16, <15,17,39,57,87,89,99,113,117,123>,\ 17, <1,9,13,31,49,61,63,85,91,99>,\ 18, <5,11,17,23,33,35,41,65,75,93>,\ 19, <1,19,27,31,45,57,67,69,85,87>,\ 20, <3,5,17,27,59,69,129,143,153,185>,\ 21, <9,19,21,55,61,69,105,111,121,129>,\ 22, <3,17,27,33,57,87,105,113,117,123>,\ 23, <15,21,27,37,61,69,135,147,157,159>,\ 24, <3,17,33,63,75,77,89,95,117,167>,\ 25, <39,49,61,85,91,115,141,159,165,183>,\ 26, <5,27,45,87,101,107,111,117,125,135>,\ 27, <39,79,111,115,135,187,199,219,231,235>,\ 28, <57,89,95,119,125,143,165,183,213,273>,\ 29, <3,33,43,63,73,75,93,99,121,133>,\ 30, <35,41,83,101,105,107,135,153,161,173>,\ 31, <1,19,61,69,85,99,105,151,159,171>,\ 32, <5,17,65,99,107,135,153,185,209,267>,\ 33, <9,25,49,79,105,285,301,303,321,355>,\ 34, <41,77,113,131,143,165,185,207,227,281>,\ 35, <31,49,61,69,79,121,141,247,309,325>,\ 36, <5,17,23,65,117,137,159,173,189,233>,\ 37, <25,31,45,69,123,141,199,201,351,375>,\ 38, <45,87,107,131,153,185,191,227,231,257>,\ 39, <7,19,67,91,135,165,219,231,241,301>,\ 40, <87,167,195,203,213,285,293,299,389,437>,\ 41, <21,31,55,63,73,75,91,111,133,139>,\ 42, <11,17,33,53,65,143,161,165,215,227>,\ 43, <57,67,117,175,255,267,291,309,319,369>,\ 44, <17,117,119,129,143,149,287,327,359,377>,\ 45, <55,69,81,93,121,133,139,159,193,229>,\ 46, <21,57,63,77,167,197,237,287,305,311>,\ 47, <115,127,147,279,297,339,435,541,619,649>,\ 48, <59,65,89,93,147,165,189,233,243,257>,\ 49, <81,111,123,139,181,201,213,265,283,339>,\ 50, <27,35,51,71,113,117,131,161,195,233>,\ 51, <129,139,165,231,237,247,355,391,397,439>,\ 52, <47,143,173,183,197,209,269,285,335,395>,\ 53, <111,145,231,265,315,339,343,369,379,421>,\ 54, <33,53,131,165,195,245,255,257,315,327>,\ 55, <55,67,99,127,147,169,171,199,207,267>,\ 56, <5,27,47,57,89,93,147,177,189,195>,\ 57, <13,25,49,61,69,111,195,273,363,423>,\ 58, <27,57,63,137,141,147,161,203,213,251>,\ 59, <55,99,225,427,517,607,649,687,861,871>,\ 60, <93,107,173,179,257,279,369,395,399,453>,\ 61, <1,31,45,229,259,283,339,391,403,465>,\ 62, <57,87,117,143,153,167,171,195,203,273>,\ 63, <25,165,259,301,375,387,391,409,457,471>,\ 64, <59,83,95,179,189,257,279,323,353,363>,\ \; Not used for testing, currently: 65, <49,79,115,141,163,229,301,345,453,493>,\ 66, <5,45,173,203,275,297,387,401,443,495>,\ 67, <19,31,49,57,61,75,81,165,181,237>,\ 68, <23,83,125,147,149,167,285,315,345,357>,\ 69, <19,91,93,103,129,153,165,201,255,385>,\ 70, <35,71,167,215,263,267,273,447,473,585>,\ 71, <231,325,411,435,441,465,559,577,601,721>,\ 72, <93,107,129,167,249,269,329,347,429,473>,\ 73, <69,181,199,273,319,433,475,501,523,645>,\ 74, <35,45,57,135,153,237,257,275,461,465>,\ 75, <97,207,231,271,279,289,325,381,409,427>,\ 76, <15,63,117,123,143,189,215,267,285,347>,\ 77, <33,43,145,163,195,261,295,379,433,451>,\ 78, <11,95,111,123,147,153,191,263,303,507>,\ 79, <67,199,249,277,355,367,405,447,477,511>,\ 80, <65,93,117,143,285,317,549,645,765,933>,\ 81, <51,63,163,205,333,349,429,433,481,553>,\ 82, <57,113,185,315,363,365,375,453,623,635>,\ 83, <55,97,117,121,139,285,307,405,429,561>,\ 84, <35,69,213,215,333,399,525,563,587,597>,\ 85, <19,61,181,295,411,433,469,519,531,823>,\ 86, <35,41,65,71,113,255,261,293,357,461>,\ 87, <67,129,181,195,201,217,261,277,289,339>,\ 88, <299,455,483,563,605,719,735,743,753,797>,\ 89, <1,21,31,49,69,99,103,265,321,441>,\ 90, <33,41,53,75,227,263,273,291,297,317>,\ 91, <45,81,111,201,315,339,567,619,655,771>,\ 92, <83,149,197,317,363,419,485,497,519,537>,\ 93, <25,51,79,105,273,405,489,553,571,579>,\ 94, <3,11,105,173,273,297,321,395,407,431>,\ 95, <15,37,211,339,387,415,441,447,555,561>,\ 96, <17,87,93,147,165,189,237,243,315,347>,\ 97, <141,165,349,399,453,595,729,741,859,885>,\ 98, <51,65,107,117,141,227,273,363,471,525>,\ 99, <115,145,247,319,381,427,675,717,1207,1231>,\ 100, <15,99,153,183,267,285,357,479,603,833>,\ ; Note: the largest gap present above is 84. So, we can represent all the ; offsets as byte values: start at zero and every byte is added to the prior ; total, which is to be subtracted from the power of two. assert power - 4 = % ; no missing lines ; Note: Only values of 64-bits are less are being used. if power < 65 .last = 0 ; db power iterate O,offsets {const:16} db O - .last .last = O end iterate {const:16} align 16 ; shift power to find offsets end if ; Range checking, used to debug list: .last = 0 iterate O,offsets assert .last < O & %% = 10 ; increasing values if O - .last > 255 ; byte value? repeat 1,p:power,o:O display 10,9,'warning: 2^',`p,'-',`o,' exceeds byte scheme.' end repeat end if .last = O end iterate end iterate |
|||
![]() |
|
six_L 25 Mar 2025, 12:45
Hello,bitRAKE
It's my fault, the output buffer definition is too small. A great algorithm ! All is fine. 18446744073709551614 >= Prime ( 18446744073709551557 ) if using the Eratosthenes sieve algorithm, It consumes a lot of memory[1048575(T)]. Quote: 2^64 = 18446744073709551615 Quote: Intel(R) Core(TM) i5-9400H CPU @ 2.50GHz Thank again. |
|||
![]() |
|
bitRAKE 25 Mar 2025, 20:38
six_L wrote: if using the Eratosthenes sieve algorithm, It consumes a lot of memory |
|||
![]() |
|
six_L 26 Mar 2025, 02:55
Hello,bitRAKE
Quote: At the other end of the spectrum is the prime sieve. Thanks you for the link. I haven't figured out your macro yet. 1, PRIME_LIMIT = ? 2, How do use the macros "PRIME_TAB_Initialize" and Print a prime? |
|||
![]() |
|
bitRAKE 26 Mar 2025, 04:02
It's a little convoluted because it does positive and negative logic (primes can be 0-bits or 1-bits)*.
Looking at problems 3, we can see initialize requires: Code: ; somewhere you can put data: PRIME_LIMIT = 1_000_000 ; the limit of covered numbers INCLUDE "prime_sieve.inc" ; in the code section: PRIME_TAB_Initialize It might be helpful to examine its use in other problems, too. I'm happy to create a simplified example of just the bit sieve. You'd be surprised how fast a billion primes can be generated. Just post a template and I'll adapt the code to it. * If you strip out the neg_logic (just assume it's zero) then it'll simplify the code. You could search the board as well - I've posted it here long ago without the polarity diversity. |
|||
![]() |
|
Mike Gonta 26 Mar 2025, 21:26
From the-idiom - what's your license bitRAKE?
FASMg is the hidden back end to generate the-idiom code from assembly language. Code: to decide if a @number is prime is: $8B442404; // mov eax, [esp+4] $83F802; // cmp eax, 2 $743B; // je .exit $83F801; // cmp eax, 1 $7504; // jne .1 $F7D0; // not eax $EB30; // jmp .cmp_exit // .1: $0FBAE000; // bt eax, 0 $7327; // jnc .setc_exit $89C1; // mov ecx, eax $BE01000000; // mov esi, 1 $89F7; // mov edi, esi // .loop: $8D14B504000000; // lea edx, [esi*4+4] $83C602; // add esi, 2 $01D7; // add edi, edx $7210; // jc .setc_exit $89C8; // mov eax, ecx $31D2; // xor edx, edx $39F8; // cmp eax, edi $7208; // jc .setc_exit $F7F6; // div esi $89D0; // mov eax, edx $85C0; // test eax, eax $75E2; // jne .loop // .setc_exit: $0F92C0; // setc al // .cmp_exit: $3C01; // cmp al, 1 // .exit: $C20400; // ret 4 |
|||
![]() |
|
bitRAKE 26 Mar 2025, 23:07
Mike Gonta wrote: From the-idiom - what's your license bitRAKE? |
|||
![]() |
|
six_L 27 Mar 2025, 10:48
Hello,bitRAKE
bitRAKE wrote: You could search the board as well - I've posted it here long ago without the polarity diversity. I have still not understood how the P003.asm works. I searched the board again and again, could not find the worked codes(exe) of P003.asm. Would you give me a direct link? Regards six_L |
|||
![]() |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.