Mathematical processing routines

Index > Tutorials and Examples > Mathematical processing routines

Author

Thread

FoXx

Joined: 25 Feb 2025
Posts: 33

FoXx 17 Dec 2025, 08:27

Procedures for finding square roots.

Code:

continue        equ @r
skip            equ @f
;------------------------------------------------
;       * * *  Func Square Root  * * *
;------------------------------------------------
proc Square

;       mov ESI, Value 

        xor ECX, ECX                    ;       int Square  = 0;
        mov EBX, 10000h                 ;       int FindBit = 1 << 16;
        @@:                             ;       do
                shr EBX, 1              ;       {       do
                mov EAX, EBX            ;               {
                or  EAX, ECX            ;                 FindBit =>> 1;
                mov EDX, EAX            ;                 int A = Square | FindBit;
                mul EDX                 ;                 int X = A * A;
                cmp EAX, ESI            ;               }
                ja continue             ;               whilw( X  > Value );
                je skip                 ;                  if( X == Value ) break;

                        or ECX, EBX     ;               Square = Square | FindBit;
                        or EBX, EBX     ;       }
                        jnz continue    ;       while(!FindBit );
        @@:
        mov EAX, ECX
        or  EAX, EBX                    ;       return ( Square | FindBit );
        ret
endp

Last edited by FoXx on 18 Dec 2025, 08:51; edited 5 times in total

17 Dec 2025, 08:27

FoXx

Joined: 25 Feb 2025
Posts: 33

FoXx 17 Dec 2025, 08:34

Methods for finding the square root of a 64-bit value. When using x64, the value can be increased to 128 bits.

Code:

;------------------------------------------------
;       * * *  Func Double Square Root  * * *
;------------------------------------------------
proc Square64

;       mov EDI, HighValue
;       mov ESI, LowValue

        xor ECX, ECX                            ;       int Square  = 0;
        mov EBX, 80000000h                      ;       int FindBit = 1 << 31;
        @@:                                     ;       do
                shr EBX, 1                      ;       {       do
                mov EAX, EBX                    ;               {
                or  EAX, ECX                    ;                 FindBit =>> 1;
                mov EDX, EAX                    ;                 int   A = Square | FindBit;
                mul EDX                         ;                 int64 X = A * A;
                cmp EDX, EDI                    ;               }
                ja continue                     ;               whilw ( X.high > HighValue ||
                                                ;                       X.low  > LowValue  );
                cmp EAX, ESI                    ;                  if ( X.low == LowValue  ) break;
                ja continue
                je skip                         ;               Square = Square | FindBit;

                        or ECX, EBX             ;       }
                        or EBX, EBX             ;       while(!FindBit );
                        jnz continue
        @@:
        mov EAX, ECX
        or  EAX, EBX                            ;       return ( Square | FindBit );
        ret
endp

Last edited by FoXx on 17 Dec 2025, 15:18; edited 1 time in total

17 Dec 2025, 08:34

FoXx

Joined: 25 Feb 2025
Posts: 33

FoXx 17 Dec 2025, 09:28

This is a very fast square root search.The maximum number of attempts is equal to the number of bits in the number.

Last edited by FoXx on 17 Dec 2025, 16:48; edited 1 time in total

17 Dec 2025, 09:28

bitRAKE

Joined: 21 Jul 2003
Posts: 4364
Location: vpcmpistri

bitRAKE 17 Dec 2025, 14:06

If you're interested in performance: MUL is definitely not needed in the loop, two branches are sufficient. I haven't extended this to 128-bit, but I did write a general implementation for fasmg (probably the easiest to understand the algorithm).

_________________
¯\(°_o)/¯ AI may [not] have aided with the above reply.

17 Dec 2025, 14:06

FoXx

Joined: 25 Feb 2025
Posts: 33

FoXx 17 Dec 2025, 17:05

bitRAKE wrote:

MUL is definitely not needed in the loop...

The processor architecture multiplies numbers by shifting and adding. These are the processor's internal cycles. Completely replacing the multiplication operation with a series of additions will not speed up the process. Any jumps in the multiplication cycle will cause the processor to crash, as the cache memory becomes breaked..

17 Dec 2025, 17:05

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20864
Location: In your JS exploiting you and your system

revolution 17 Dec 2025, 18:25

Some notes on the stated cycles counts above (now removed?):

Measurement is the only way to know how code will perform. Predicting performance from reading the code is not possible these days. Run the code, measure the result, compare A to B and pick the one that runs best on the system under test. Different systems can give different results, so test each system separately.

CPUs built today no longer use naive shift+add to multiply, they use much more capable circuits that can complete a full 64x64->128 is just a few cycles.

17 Dec 2025, 18:25

FoXx

Joined: 25 Feb 2025
Posts: 33

FoXx 18 Dec 2025, 07:42

revolution wrote:

CPUs built today no longer use naive shift+add to multiply, they use much more capable circuits that can complete a full 64x64->128 is just a few cycles.

Modern architecture accelerates processor operation through the use of pipelines. The fundamental principle of numerical multiplication remains unchanged. The speed of multiplication depends on the number of bits in the number. Because:

X += 1 * X = X + X
X += 0 * X = X

Code:

;       mov EAX, X      ;       X * Y
;       mov EBX, Y

        mov ERX, EAX    ;       system register
        xor EAX, EAX
        mov EDX, EAX
        mov ECX, 32     ;       register size
@@:
        shr EBX, 1
        jc  skip

                add EAX, ERX
                adc EDX, 0
@@:
        shl EAX, 1
        rcl EDX, 1

        loop continue

;       EDX:EAX = result

revolution wrote:

Some notes on the stated cycles counts above (now removed?)

I estimated the number of computational cycles using the "triangle" method. This method searches nodes in ascending order. It's similar to sorting.

However, I use the reverse method (top-down). When a lower value is found, we set a bit. This position becomes the start of the next search. This is similar to quicksearch. This method eliminates upward searches. The maximum number of cycles will not exceed the number of bits in the variable minus 1.

Last edited by FoXx on 18 Dec 2025, 07:52; edited 1 time in total

18 Dec 2025, 07:42

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20864
Location: In your JS exploiting you and your system

revolution 18 Dec 2025, 07:51

FoXx wrote:

Modern architecture accelerates processor operation through the use of pipelines. The fundamental principle of numerical multiplication remains unchanged. The speed of multiplication depends on the number of bits in the number.

It is true that pipelines are used to increase throughput, but pipelines can't improve latency. Many of the latest CPUs can compute a full length multiply with a latency of just a few cycles. Then on top of that, pipelines are used to get more throughput, so multiple multiplies can be in-flight at once. In some CPUs the MUL throughput can be more than one-per-cycle, with latencies of 3 or 4 cycles.

Modern multiplication circuits are very advanced, and it is fascinating to learn about how they work internally to get such amazing performance.

Some of the low power in-order Cortex-M CPUs can do a full multiply in a single cycle!

18 Dec 2025, 07:51

FoXx

Joined: 25 Feb 2025
Posts: 33

FoXx 18 Dec 2025, 08:13

revolution, I completely agree with you.
The dawn of the digital camera era led to a race to speed up the JPEG algorithm. The hardware was slow, and various companies paid big money to develop fast matrix multiplication algorithms. I can't remember now, but Canon probably used only 29 multiplications and ??? additions for an 8x8 matrix. Those were the good old days.

I had experience writing a procedure for fast exponentiation. My ingenuity wasn't enough, so I had to learn a standard fast exponentiation algorithm. After understanding the mathematical formalism, I was able to write it.

Quote:

Some of the low power in-order Cortex-M CPUs can do a full multiply in a single cycle!

As I understand it, we are talking about a microcontroller?

18 Dec 2025, 08:13

< Last Thread | Next Thread >

Forum Rules:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum