flat assembler
Message board for the users of flat assembler.
![]() |
Author |
|
AsmGuru62 01 Jan 2012, 11:46
You can also multiply with LEA:
Code: ; ; EAX MUL 5 ; lea eax, [eax*4 + eax] ; ; EAX MUL 3 ; lea eax, [eax*2 + eax] |
|||
![]() |
|
typedef 01 Jan 2012, 12:25
Try the multiplication by the size of a register Tomasz quoted someone about. I can't remember the actual thread but it's recent.
happy new year |
|||
![]() |
|
NanoBytes 01 Jan 2012, 18:21
I am looking for the fastest way to multiply, is your LEA command faster than the MUL instruction?
|
|||
![]() |
|
NanoBytes 01 Jan 2012, 18:22
And if so, is there a corresponding way to divide?
|
|||
![]() |
|
AsmGuru62 01 Jan 2012, 19:50
LEA is faster, but how to divide - no idea.
Tell me why exactly you optimizing? Is your code running slower? -- I mean noticeably slower. |
|||
![]() |
|
NanoBytes 02 Jan 2012, 04:08
No see, I am trying to make a compiler piggybacking of of fasm, and I figured that I need the fastest code that I could get to be credible, but it doesn't matter because the LEA approach only multiples by constants, and you usually arnt going to be seeing many of those in an upper level language.
|
|||
![]() |
|
typedef 02 Jan 2012, 04:56
Are you trying to link with FASM produced object files?
|
|||
![]() |
|
cod3b453 02 Jan 2012, 19:05
It would be easier to answer with more information.
In general the best way to multiply is mul but when I say mul this could be GPR mul, FPU fmul or SSE (v)pmul according to your requirements. When small constants or powers of 2 are used either lea or shl/psll/pshuf can be used for faster results. The same is true for division, in general the div instructions are the best but can be replaced by shr/psrl/pshuf for powers of 2 or reciprocal multiplication. |
|||
![]() |
|
NanoBytes 02 Jan 2012, 20:18
Code: MOV EAX, 0007 ; Load up registers MOV EBX, 0005 PUSH EBX PUSH ECX PUSH EDX XOR ECX,ECX CMP EBX, 0 JZ FIN MORE: MOV EDX,EBX AND EDX, 0001 ; Is lowest bit 1? JZ NOADD ; No, do not add. ADD ECX,EAX NOADD: SHL EAX,1 ; Shift AX left (double). SHR EBX,1 ; Shift BX right (integer halve, next bit). JNZ MORE ; Keep going until no more bits in BX. FIN: MOV EAX,ECX POP EDX ; Restore registers and return. POP ECX POP EBX ; EAX Holds the value This is an example of a "shift and add" multiplication, and I wanted to know if it is faster than MUL I can slightly modify it to do division so I wanted to know if this is productive or destructive _________________ He is no fool who gives what he cannot keep to gain what he cannot loose. |
|||
![]() |
|
Xorpd! 02 Jan 2012, 22:59
OK, from your code we can tell that you're a newbie, but that's fine: everyone has to start somewhere.
About the newbie stuff: I see the sequence MOV EDX, EBX AND EDX, 0001 JZ NOADD in a context where EDX is subsequently discarded. In ISAs with flags, there is a TEST instruction which performs an AND, discarding the results but setting the flags appropriately. Thus TEST EBX, 0001 JZ NOADD would have the same effect but save you an instruction. In ISAs without flags, obviously there can be no TEST but Alpha had a conditional branch according to the status of the LSB of any register already. Your inner loop contains a conditional branch so that assuming EBX can take on different values going into your code sequence there will likely be one or more branch mispredictions, each of which is more expensive in both a latency or throughput sense than an IMUL instruction. You could fix this with CMOVxx instructions: MORE: TEST EBX, 0001 LEA EDX, [ECX+EAX] CMOVNZ ECX, EDX But even so the code will be much more expensive than IMUL. For one thing, if all values of EBX aren't within a factor of two of each other, the loop will have nonconstant trip count and so will lead to a branch misprediction at the end. Also the hradware multiplier on current x86 processors is really fast: Agner Fog's instruction_tables.pdf list latency 3 clocks, throughput 1 per clock for IMUL r32, r32. So my recommendation is that you download Agner Fog's optimization manuals. Also think about what it means for an instruction sequence to be 'fast'. For a sequential processor like a 486 there is not much ambiguity, but for pipelined processors there is latency (the number of clock cycles it takes to execute an instruction sequence) and throughput (the number of instruction sequences that can be issued in a given time span). It can be the case that an instruction sequence can have long latency but execute in otherwise unused ports, thus making it effectively faster that a lower latency sequence that fights other useful instructions for ports. But don't be discouraged about making wrong guesses the first time out. If you spend some time working with assembly language code you will get a better vision for how the processor works and after a while you will see that people who never did any assembly language programming have very limited insight regarding low-level optimization. |
|||
![]() |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.