flat assembler
Message board for the users of flat assembler.

Index > Windows > Is the "shift and add" faster than the MUL instruc

Author
Thread Post new topic Reply to topic
NanoBytes



Joined: 02 Jun 2011
Posts: 57
Location: Iowa, United States
NanoBytes 31 Dec 2011, 21:09
I am trying to optimize my program and I wanted to know if replacing my mul instructions will improve speed/memory.

The same question goes for division

_________________
He is no fool who gives what he cannot
keep to gain what he cannot loose.
Post 31 Dec 2011, 21:09
View user's profile Send private message Send e-mail Visit poster's website Reply with quote
AsmGuru62



Joined: 28 Jan 2004
Posts: 1694
Location: Toronto, Canada
AsmGuru62 01 Jan 2012, 11:46
You can also multiply with LEA:
Code:
;
; EAX MUL 5
;
lea      eax, [eax*4 + eax]
;
; EAX MUL 3
;
lea      eax, [eax*2 + eax]
    
Post 01 Jan 2012, 11:46
View user's profile Send private message Send e-mail Reply with quote
typedef



Joined: 25 Jul 2010
Posts: 2909
Location: 0x77760000
typedef 01 Jan 2012, 12:25
Try the multiplication by the size of a register Tomasz quoted someone about. I can't remember the actual thread but it's recent.

happy new year
Post 01 Jan 2012, 12:25
View user's profile Send private message Reply with quote
NanoBytes



Joined: 02 Jun 2011
Posts: 57
Location: Iowa, United States
NanoBytes 01 Jan 2012, 18:21
I am looking for the fastest way to multiply, is your LEA command faster than the MUL instruction?
Post 01 Jan 2012, 18:21
View user's profile Send private message Send e-mail Visit poster's website Reply with quote
NanoBytes



Joined: 02 Jun 2011
Posts: 57
Location: Iowa, United States
NanoBytes 01 Jan 2012, 18:22
And if so, is there a corresponding way to divide?
Post 01 Jan 2012, 18:22
View user's profile Send private message Send e-mail Visit poster's website Reply with quote
AsmGuru62



Joined: 28 Jan 2004
Posts: 1694
Location: Toronto, Canada
AsmGuru62 01 Jan 2012, 19:50
LEA is faster, but how to divide - no idea.
Tell me why exactly you optimizing?
Is your code running slower? -- I mean noticeably slower.
Post 01 Jan 2012, 19:50
View user's profile Send private message Send e-mail Reply with quote
NanoBytes



Joined: 02 Jun 2011
Posts: 57
Location: Iowa, United States
NanoBytes 02 Jan 2012, 04:08
No see, I am trying to make a compiler piggybacking of of fasm, and I figured that I need the fastest code that I could get to be credible, but it doesn't matter because the LEA approach only multiples by constants, and you usually arnt going to be seeing many of those in an upper level language.
Post 02 Jan 2012, 04:08
View user's profile Send private message Send e-mail Visit poster's website Reply with quote
typedef



Joined: 25 Jul 2010
Posts: 2909
Location: 0x77760000
typedef 02 Jan 2012, 04:56
Are you trying to link with FASM produced object files?
Post 02 Jan 2012, 04:56
View user's profile Send private message Reply with quote
cod3b453



Joined: 25 Aug 2004
Posts: 618
cod3b453 02 Jan 2012, 19:05
It would be easier to answer with more information.

In general the best way to multiply is mul but when I say mul this could be GPR mul, FPU fmul or SSE (v)pmul according to your requirements. When small constants or powers of 2 are used either lea or shl/psll/pshuf can be used for faster results.

The same is true for division, in general the div instructions are the best but can be replaced by shr/psrl/pshuf for powers of 2 or reciprocal multiplication.
Post 02 Jan 2012, 19:05
View user's profile Send private message Reply with quote
NanoBytes



Joined: 02 Jun 2011
Posts: 57
Location: Iowa, United States
NanoBytes 02 Jan 2012, 20:18
Code:
MOV    EAX, 0007    ; Load up registers
        MOV    EBX, 0005

        PUSH   EBX
        PUSH   ECX
        PUSH   EDX

        XOR    ECX,ECX

        CMP    EBX, 0
        JZ     FIN

MORE:   MOV    EDX,EBX

        AND    EDX, 0001    ; Is lowest bit 1?
        JZ     NOADD       ; No, do not add.
        ADD    ECX,EAX

NOADD:  SHL    EAX,1        ; Shift AX left (double).
        SHR    EBX,1        ; Shift BX right (integer halve, next bit).
        JNZ    MORE        ; Keep going until no more bits in BX.


FIN:    MOV    EAX,ECX

        POP    EDX          ; Restore registers and return.
        POP    ECX
        POP    EBX   
        ; EAX Holds the value
    

This is an example of a "shift and add" multiplication, and I wanted to know if it is faster than MUL

I can slightly modify it to do division so I wanted to know if this is productive or destructive

_________________
He is no fool who gives what he cannot
keep to gain what he cannot loose.
Post 02 Jan 2012, 20:18
View user's profile Send private message Send e-mail Visit poster's website Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd! 02 Jan 2012, 22:59
OK, from your code we can tell that you're a newbie, but that's fine: everyone has to start somewhere.
About the newbie stuff: I see the sequence
MOV EDX, EBX
AND EDX, 0001
JZ NOADD
in a context where EDX is subsequently discarded. In ISAs with flags, there is a TEST instruction which performs an AND, discarding the results but setting the flags appropriately. Thus
TEST EBX, 0001
JZ NOADD
would have the same effect but save you an instruction. In ISAs without flags, obviously there can be no TEST but Alpha had a conditional branch according to the status of the LSB of any register already.

Your inner loop contains a conditional branch so that assuming EBX can take on different values going into your code sequence there will likely be one or more branch mispredictions, each of which is more expensive in both a latency or throughput sense than an IMUL instruction. You could fix this with CMOVxx instructions:

MORE:
TEST EBX, 0001
LEA EDX, [ECX+EAX]
CMOVNZ ECX, EDX

But even so the code will be much more expensive than IMUL. For one thing, if all values of EBX aren't within a factor of two of each other, the loop will have nonconstant trip count and so will lead to a branch misprediction at the end. Also the hradware multiplier on current x86 processors is really fast: Agner Fog's instruction_tables.pdf list latency 3 clocks, throughput 1 per clock for IMUL r32, r32.

So my recommendation is that you download Agner Fog's optimization manuals. Also think about what it means for an instruction sequence to be 'fast'. For a sequential processor like a 486 there is not much ambiguity, but for pipelined processors there is latency (the number of clock cycles it takes to execute an instruction sequence) and throughput (the number of instruction sequences that can be issued in a given time span). It can be the case that an instruction sequence can have long latency but execute in otherwise unused ports, thus making it effectively faster that a lower latency sequence that fights other useful instructions for ports.

But don't be discouraged about making wrong guesses the first time out. If you spend some time working with assembly language code you will get a better vision for how the processor works and after a while you will see that people who never did any assembly language programming have very limited insight regarding low-level optimization.
Post 02 Jan 2012, 22:59
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.