flat assembler
Message board for the users of flat assembler.
Index
> Tutorials and Examples > Mathematical processing routines |
| Author |
|
|
FoXx 17 Dec 2025, 08:34
Methods for finding the square root of a 64-bit value. When using x64, the value can be increased to 128 bits.
Code: ;------------------------------------------------ ; * * * Func Double Square Root * * * ;------------------------------------------------ proc Square64 ; mov EDI, HighValue ; mov ESI, LowValue xor ECX, ECX ; int Square = 0; mov EBX, 80000000h ; int FindBit = 1 << 31; @@: ; do shr EBX, 1 ; { do mov EAX, EBX ; { or EAX, ECX ; FindBit =>> 1; mov EDX, EAX ; int A = Square | FindBit; mul EDX ; int64 X = A * A; cmp EDX, EDI ; } ja continue ; whilw ( X.high > HighValue || ; X.low > LowValue ); cmp EAX, ESI ; if ( X.low == LowValue ) break; ja continue je skip ; Square = Square | FindBit; or ECX, EBX ; } or EBX, EBX ; while(!FindBit ); jnz continue @@: mov EAX, ECX or EAX, EBX ; return ( Square | FindBit ); ret endp Last edited by FoXx on 17 Dec 2025, 15:18; edited 1 time in total |
|||
|
|
FoXx 17 Dec 2025, 09:28
This is a very fast square root search.The maximum number of attempts is equal to the number of bits in the number.
Last edited by FoXx on 17 Dec 2025, 16:48; edited 1 time in total |
|||
|
|
bitRAKE 17 Dec 2025, 14:06
If you're interested in performance: MUL is definitely not needed in the loop, two branches are sufficient. I haven't extended this to 128-bit, but I did write a general implementation for fasmg (probably the easiest to understand the algorithm).
_________________ ¯\(°_o)/¯ AI may [not] have aided with the above reply. |
|||
|
|
FoXx 17 Dec 2025, 17:05
bitRAKE wrote: MUL is definitely not needed in the loop... |
|||
|
|
revolution 17 Dec 2025, 18:25
Some notes on the stated cycles counts above (now removed?):
Measurement is the only way to know how code will perform. Predicting performance from reading the code is not possible these days. Run the code, measure the result, compare A to B and pick the one that runs best on the system under test. Different systems can give different results, so test each system separately. CPUs built today no longer use naive shift+add to multiply, they use much more capable circuits that can complete a full 64x64->128 is just a few cycles. |
|||
|
|
FoXx 18 Dec 2025, 07:42
revolution wrote: CPUs built today no longer use naive shift+add to multiply, they use much more capable circuits that can complete a full 64x64->128 is just a few cycles.
X += 0 * X = X Code: ; mov EAX, X ; X * Y ; mov EBX, Y mov ERX, EAX ; system register xor EAX, EAX mov EDX, EAX mov ECX, 32 ; register size @@: shr EBX, 1 jc skip add EAX, ERX adc EDX, 0 @@: shl EAX, 1 rcl EDX, 1 loop continue ; EDX:EAX = result revolution wrote: Some notes on the stated cycles counts above (now removed?) However, I use the reverse method (top-down). When a lower value is found, we set a bit. This position becomes the start of the next search. This is similar to quicksearch. This method eliminates upward searches. The maximum number of cycles will not exceed the number of bits in the variable minus 1. Last edited by FoXx on 18 Dec 2025, 07:52; edited 1 time in total |
|||
|
|
revolution 18 Dec 2025, 07:51
FoXx wrote: Modern architecture accelerates processor operation through the use of pipelines. The fundamental principle of numerical multiplication remains unchanged. The speed of multiplication depends on the number of bits in the number. Modern multiplication circuits are very advanced, and it is fascinating to learn about how they work internally to get such amazing performance. Some of the low power in-order Cortex-M CPUs can do a full multiply in a single cycle! |
|||
|
|
FoXx 18 Dec 2025, 08:13
revolution, I completely agree with you.
The dawn of the digital camera era led to a race to speed up the JPEG algorithm. The hardware was slow, and various companies paid big money to develop fast matrix multiplication algorithms. I can't remember now, but Canon probably used only 29 multiplications and ??? additions for an 8x8 matrix. Those were the good old days. I had experience writing a procedure for fast exponentiation. My ingenuity wasn't enough, so I had to learn a standard fast exponentiation algorithm. After understanding the mathematical formalism, I was able to write it. Quote: Some of the low power in-order Cortex-M CPUs can do a full multiply in a single cycle! |
|||
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.