flat assembler
Message board for the users of flat assembler.

Index > Tutorials and Examples > Mathematical processing routines

Author
Thread Post new topic Reply to topic
FoXx



Joined: 25 Feb 2025
Posts: 33
FoXx 17 Dec 2025, 08:27
Procedures for finding square roots.
Code:
continue        equ @r
skip            equ @f
;------------------------------------------------
;       * * *  Func Square Root  * * *
;------------------------------------------------
proc Square

;       mov ESI, Value 

        xor ECX, ECX                    ;       int Square  = 0;
        mov EBX, 10000h                 ;       int FindBit = 1 << 16;
        @@:                             ;       do
                shr EBX, 1              ;       {       do
                mov EAX, EBX            ;               {
                or  EAX, ECX            ;                 FindBit =>> 1;
                mov EDX, EAX            ;                 int A = Square | FindBit;
                mul EDX                 ;                 int X = A * A;
                cmp EAX, ESI            ;               }
                ja continue             ;               whilw( X  > Value );
                je skip                 ;                  if( X == Value ) break;

                        or ECX, EBX     ;               Square = Square | FindBit;
                        or EBX, EBX     ;       }
                        jnz continue    ;       while(!FindBit );
        @@:
        mov EAX, ECX
        or  EAX, EBX                    ;       return ( Square | FindBit );
        ret
endp    


Last edited by FoXx on 18 Dec 2025, 08:51; edited 5 times in total
Post 17 Dec 2025, 08:27
View user's profile Send private message Reply with quote
FoXx



Joined: 25 Feb 2025
Posts: 33
FoXx 17 Dec 2025, 08:34
Methods for finding the square root of a 64-bit value. When using x64, the value can be increased to 128 bits.
Code:
;------------------------------------------------
;       * * *  Func Double Square Root  * * *
;------------------------------------------------
proc Square64

;       mov EDI, HighValue
;       mov ESI, LowValue

        xor ECX, ECX                            ;       int Square  = 0;
        mov EBX, 80000000h                      ;       int FindBit = 1 << 31;
        @@:                                     ;       do
                shr EBX, 1                      ;       {       do
                mov EAX, EBX                    ;               {
                or  EAX, ECX                    ;                 FindBit =>> 1;
                mov EDX, EAX                    ;                 int   A = Square | FindBit;
                mul EDX                         ;                 int64 X = A * A;
                cmp EDX, EDI                    ;               }
                ja continue                     ;               whilw ( X.high > HighValue ||
                                                ;                       X.low  > LowValue  );
                cmp EAX, ESI                    ;                  if ( X.low == LowValue  ) break;
                ja continue
                je skip                         ;               Square = Square | FindBit;

                        or ECX, EBX             ;       }
                        or EBX, EBX             ;       while(!FindBit );
                        jnz continue
        @@:
        mov EAX, ECX
        or  EAX, EBX                            ;       return ( Square | FindBit );
        ret
endp    


Last edited by FoXx on 17 Dec 2025, 15:18; edited 1 time in total
Post 17 Dec 2025, 08:34
View user's profile Send private message Reply with quote
FoXx



Joined: 25 Feb 2025
Posts: 33
FoXx 17 Dec 2025, 09:28
This is a very fast square root search.The maximum number of attempts is equal to the number of bits in the number.


Last edited by FoXx on 17 Dec 2025, 16:48; edited 1 time in total
Post 17 Dec 2025, 09:28
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4330
Location: vpcmpistri
bitRAKE 17 Dec 2025, 14:06
If you're interested in performance: MUL is definitely not needed in the loop, two branches are sufficient. I haven't extended this to 128-bit, but I did write a general implementation for fasmg (probably the easiest to understand the algorithm).

_________________
¯\(°_o)/¯ AI may [not] have aided with the above reply.
Post 17 Dec 2025, 14:06
View user's profile Send private message Visit poster's website Reply with quote
FoXx



Joined: 25 Feb 2025
Posts: 33
FoXx 17 Dec 2025, 17:05
bitRAKE wrote:
MUL is definitely not needed in the loop...
The processor architecture multiplies numbers by shifting and adding. These are the processor's internal cycles. Completely replacing the multiplication operation with a series of additions will not speed up the process. Any jumps in the multiplication cycle will cause the processor to crash, as the cache memory becomes breaked..
Post 17 Dec 2025, 17:05
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20798
Location: In your JS exploiting you and your system
revolution 17 Dec 2025, 18:25
Some notes on the stated cycles counts above (now removed?):

Measurement is the only way to know how code will perform. Predicting performance from reading the code is not possible these days. Run the code, measure the result, compare A to B and pick the one that runs best on the system under test. Different systems can give different results, so test each system separately.

CPUs built today no longer use naive shift+add to multiply, they use much more capable circuits that can complete a full 64x64->128 is just a few cycles.
Post 17 Dec 2025, 18:25
View user's profile Send private message Visit poster's website Reply with quote
FoXx



Joined: 25 Feb 2025
Posts: 33
FoXx 18 Dec 2025, 07:42
revolution wrote:
CPUs built today no longer use naive shift+add to multiply, they use much more capable circuits that can complete a full 64x64->128 is just a few cycles.
Modern architecture accelerates processor operation through the use of pipelines. The fundamental principle of numerical multiplication remains unchanged. The speed of multiplication depends on the number of bits in the number. Because:
    X += 1 * X = X + X
    X += 0 * X = X
Code:
;       mov EAX, X      ;       X * Y
;       mov EBX, Y

        mov ERX, EAX    ;       system register
        xor EAX, EAX
        mov EDX, EAX
        mov ECX, 32     ;       register size
@@:
        shr EBX, 1
        jc  skip

                add EAX, ERX
                adc EDX, 0
@@:
        shl EAX, 1
        rcl EDX, 1

        loop continue

;       EDX:EAX = result    

revolution wrote:
Some notes on the stated cycles counts above (now removed?)
I estimated the number of computational cycles using the "triangle" method. This method searches nodes in ascending order. It's similar to sorting.

However, I use the reverse method (top-down). When a lower value is found, we set a bit. This position becomes the start of the next search. This is similar to quicksearch. This method eliminates upward searches. The maximum number of cycles will not exceed the number of bits in the variable minus 1.


Last edited by FoXx on 18 Dec 2025, 07:52; edited 1 time in total
Post 18 Dec 2025, 07:42
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20798
Location: In your JS exploiting you and your system
revolution 18 Dec 2025, 07:51
FoXx wrote:
Modern architecture accelerates processor operation through the use of pipelines. The fundamental principle of numerical multiplication remains unchanged. The speed of multiplication depends on the number of bits in the number.
It is true that pipelines are used to increase throughput, but pipelines can't improve latency. Many of the latest CPUs can compute a full length multiply with a latency of just a few cycles. Then on top of that, pipelines are used to get more throughput, so multiple multiplies can be in-flight at once. In some CPUs the MUL throughput can be more than one-per-cycle, with latencies of 3 or 4 cycles.

Modern multiplication circuits are very advanced, and it is fascinating to learn about how they work internally to get such amazing performance.

Some of the low power in-order Cortex-M CPUs can do a full multiply in a single cycle!
Post 18 Dec 2025, 07:51
View user's profile Send private message Visit poster's website Reply with quote
FoXx



Joined: 25 Feb 2025
Posts: 33
FoXx 18 Dec 2025, 08:13
revolution, I completely agree with you.
The dawn of the digital camera era led to a race to speed up the JPEG algorithm. The hardware was slow, and various companies paid big money to develop fast matrix multiplication algorithms. I can't remember now, but Canon probably used only 29 multiplications and ??? additions for an 8x8 matrix. Those were the good old days.

I had experience writing a procedure for fast exponentiation. My ingenuity wasn't enough, so I had to learn a standard fast exponentiation algorithm. After understanding the mathematical formalism, I was able to write it.
Quote:
Some of the low power in-order Cortex-M CPUs can do a full multiply in a single cycle!
As I understand it, we are talking about a microcontroller?
Post 18 Dec 2025, 08:13
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.