flat assembler
Message board for the users of flat assembler.

Index > Main > Which instructions are slow?

Author
Thread Post new topic Reply to topic
fpissarra



Joined: 10 Apr 2019
Posts: 64
fpissarra 12 Oct 2019, 13:50
Anyway... You should avoid to use INC, DEC, XLAT and LOOP instructions on processors after 486.

INC or DEC: ADD and SUB are 1 cycle faster (INC/DEC need to read-modify-write the RFLAGS to preserve CF);
XLAT: MOV AL,[RBX] is faster and you can use ANY GPR as a pointer (not only RBX);
LOOP: SUB RCX,1/JNZ is faster (but affects CF).

This applies to 32 bits too...
Post 12 Oct 2019, 13:50
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 12 Oct 2019, 13:57
fpissarra wrote:
Anyway... You should avoid to use INC, DEC, XLAT and LOOP instructions on processors after 486.

INC or DEC: ADD and SUB are 1 cycle faster (INC/DEC need to read-modify-write the RFLAGS to preserve CF);
XLAT: MOV AL,[RBX] is faster and you can use ANY GPR as a pointer (not only RBX);
LOOP: SUB RCX,1/JNZ is faster (but affects CF).

This applies to 32 bits too...
It is very tricky to statically determine the execution times of code. Different CPUs will exhibit different results. Ideally these things should be tested to know if they have any good or bad effects in any specific situation.
Post 12 Oct 2019, 13:57
View user's profile Send private message Visit poster's website Reply with quote
fpissarra



Joined: 10 Apr 2019
Posts: 64
fpissarra 12 Oct 2019, 14:45
revolution wrote:
fpissarra wrote:
Anyway... You should avoid to use INC, DEC, XLAT and LOOP instructions on processors after 486.

INC or DEC: ADD and SUB are 1 cycle faster (INC/DEC need to read-modify-write the RFLAGS to preserve CF);
XLAT: MOV AL,[RBX] is faster and you can use ANY GPR as a pointer (not only RBX);
LOOP: SUB RCX,1/JNZ is faster (but affects CF).

This applies to 32 bits too...
It is very tricky to statically determine the execution times of code. Different CPUs will exhibit different results. Ideally these things should be tested to know if they have any good or bad effects in any specific situation.

I agree, but these 4 instructions (and their derivatives like LOOPZ, LOOPNZ and XLATB) are garanteed to create slower code. Not a single high level language compiler since 486 became popular uses them... This is specially valid with modern architectures.
Post 12 Oct 2019, 14:45
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 12 Oct 2019, 15:10
fpissarra wrote:
[... but these 4 instructions (and their derivatives like LOOPZ, LOOPNZ and XLATB) are garanteed to create slower code
Well, that depends. If a longer encoding replacement instruction is causing cache thrashing then any guarantees won't be valid anymore.

Also sometimes we might want to preserve the carry flag in a loop, so LOOP can be an excellent way to avoid trashing it. A replacement sequence to save and restore the flag could end up worse. Since most HLL compiler never care about flags being transported to the top of a loop they would never have this problem.

I all depends upon what you are doing.
Post 12 Oct 2019, 15:10
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2493
Furs 12 Oct 2019, 15:13
fpissarra wrote:
Anyway... You should avoid to use INC, DEC, XLAT and LOOP instructions on processors after 486.

INC or DEC: ADD and SUB are 1 cycle faster (INC/DEC need to read-modify-write the RFLAGS to preserve CF);
XLAT: MOV AL,[RBX] is faster and you can use ANY GPR as a pointer (not only RBX);
LOOP: SUB RCX,1/JNZ is faster (but affects CF).

This applies to 32 bits too...
This is false, INC/DEC are emitted by any modern compiler. Only old Pentium CPUs are slower with them, because they didn't separate the flags register.

Well, except GCC which still emits ADD/SUB because of the developers' ego.

INC/DEC can be slower than ADD/SUB if you actually make use of the CF after using them, otherwise no. However in this case, ADD/SUB would yield invalid code so it's not like it's an issue.

INC/DEC are shorter than ADD/SUB so they are theoretically better given that everything else is just as fast. They use less instruction cache footprint.
Post 12 Oct 2019, 15:13
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 12 Oct 2019, 15:23
A lot of current CPUs actually completely separate the carry flag and use flag-renaming (similar to register-renaming) to break the dependencies.

It's quite common for this old advice about inc/dec to still keep turning up today. A lot of old websites and documents are still around.

But, the only way to know for sure is to test it. Test it in your real code. Not some artificial one billion times dedicated loop or something.
Post 12 Oct 2019, 15:23
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8351
Location: Kraków, Poland
Tomasz Grysztar 12 Oct 2019, 15:34
revolution wrote:
But, the only way to know for sure is to test it. Test it in your real code. Not some artificial one billion times dedicated loop or something.
I would like to further stress out this point. Take a look at this quickly patched up example:
Code:
format PE NX GUI

include 'win32wx.inc'

macro measured_call proc {
        push    eax ecx edx
        invoke  GetTickCount
        pop     edx ecx
        xchg    eax,[esp]
        call    proc
        pushfd
        xchg    eax,[esp+4]
        push    ebx ecx edx
        mov     ebx,eax
        invoke  GetTickCount
        sub     eax,ebx
        cinvoke wsprintf,debug_buffer,'%d ms',eax
        invoke  MessageBox,0,debug_buffer,`proc,MB_ICONINFORMATION
        pop     edx ecx ebx
        popfd
        pop     eax
}

section '.text' code readable executable

entry $

        measured_call incdec
        measured_call addsub

        invoke  ExitProcess,0

incdec:
        xor     ecx,ecx
  .again:
        inc     ebx
        inc     edx
        inc     eax
        dec     esi
        dec     edi
        dec     ecx
        jnz     .again
        ret

addsub:
        sub     ecx,ecx
  .again:
        add     ebx,1
        add     edx,1
        add     eax,1
        sub     esi,1
        sub     edi,1
        sub     ecx,1
        jnz     .again
        ret

section '.data' data readable writeable

  debug_buffer rb 1000h

section '.idata' import data readable writeable

  library kernel32,'KERNEL32.DLL',\
          user32,'USER32.DLL'

  include 'api\kernel32.inc'
  include 'api\user32.inc'

section '.reloc' fixups data readable discardable    
When I run it on my i5-6300U, the INC/DEC variant finishes in about 2 seconds, while ADD/SUB require almost 3. On your machine you may have very different results, depending on many factors (as pointed out by others already). And then: even on my machine ADD/SUB may turn out being faster for me under different circumstances. My artificial example may have proven that INC/DEC can be faster (contrary to your claim), but this does not actually prove that they would be faster in my actual code, unless I test it. It works in both directions.

On the other hand, the overall size of code can also affect the performance in a testable way.

Oh, and one more example. The LOOP instruction is terribly slow on my CPU - but even if I replace all occurrences of LOOP in the source of fasmg, the assembly times (some of them quite long) remain the same. It makes too small difference to even put a dent on them. And by using LOOP there at least I have a smaller code.
Post 12 Oct 2019, 15:34
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 13 Oct 2019, 13:30
Tomasz Grysztar wrote:
Oh, and one more example. The LOOP instruction is terribly slow on my CPU - but even if I replace all occurrences of LOOP in the source of fasmg, the assembly times (some of them quite long) remain the same. It makes too small difference to even put a dent on them. And by using LOOP there at least I have a smaller code.
You could be seeing two effects cancel each other out here.

LOOP is slow
DEC/JNZ is fast

But:

LOOP is short
DEC/JNZ is long

So:

The smaller LOOP is more friendly to cache, thus the "slowness" is compensated for by more efficient cache usage.

Or:

The "slowness" of LOOP is adding 1 million cycles to the run, and the cache is not more effective at all. But 1 million cycles is basically unmeasurable over runs of many tens of seconds for a modern CPU.

Or:

Something else completely. Who knows. Smile

Anyhow, just a guess. I don't the actual details of your CPU or the code. But certainly these things are never as simple as they might first appear.
Post 13 Oct 2019, 13:30
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8351
Location: Kraków, Poland
Tomasz Grysztar 13 Oct 2019, 13:58
revolution wrote:
Anyhow, just a guess. I don't the actual details of your CPU or the code. But certainly these things are never as simple as they might first appear.
Yeah, pretty much. I agree with everything you just said, and yet I feel we know just as much as we knew before. Very Happy
Post 13 Oct 2019, 13:58
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 14 Oct 2019, 09:26
Tomasz Grysztar wrote:
... I feel we know just as much as we knew before. Very Happy
We have learned that there is nothing to learn. Razz

Perhaps we can develop a standard answer to all the posts asking if some code will be faster:
"Dunno. Test it!".
Post 14 Oct 2019, 09:26
View user's profile Send private message Visit poster's website Reply with quote
guignol



Joined: 06 Dec 2008
Posts: 763
guignol 14 Oct 2019, 10:09
and that revolution is a dummy with a smoky granade Laughing
Post 14 Oct 2019, 10:09
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.