flat assembler
Message board for the users of flat assembler.
Index
> Main > Which instructions are slow? |
Author |
|
revolution 12 Oct 2019, 13:57
fpissarra wrote: Anyway... You should avoid to use INC, DEC, XLAT and LOOP instructions on processors after 486. |
|||
12 Oct 2019, 13:57 |
|
fpissarra 12 Oct 2019, 14:45
revolution wrote:
I agree, but these 4 instructions (and their derivatives like LOOPZ, LOOPNZ and XLATB) are garanteed to create slower code. Not a single high level language compiler since 486 became popular uses them... This is specially valid with modern architectures. |
|||
12 Oct 2019, 14:45 |
|
revolution 12 Oct 2019, 15:10
fpissarra wrote: [... but these 4 instructions (and their derivatives like LOOPZ, LOOPNZ and XLATB) are garanteed to create slower code Also sometimes we might want to preserve the carry flag in a loop, so LOOP can be an excellent way to avoid trashing it. A replacement sequence to save and restore the flag could end up worse. Since most HLL compiler never care about flags being transported to the top of a loop they would never have this problem. I all depends upon what you are doing. |
|||
12 Oct 2019, 15:10 |
|
Furs 12 Oct 2019, 15:13
fpissarra wrote: Anyway... You should avoid to use INC, DEC, XLAT and LOOP instructions on processors after 486. Well, except GCC which still emits ADD/SUB because of the developers' ego. INC/DEC can be slower than ADD/SUB if you actually make use of the CF after using them, otherwise no. However in this case, ADD/SUB would yield invalid code so it's not like it's an issue. INC/DEC are shorter than ADD/SUB so they are theoretically better given that everything else is just as fast. They use less instruction cache footprint. |
|||
12 Oct 2019, 15:13 |
|
revolution 12 Oct 2019, 15:23
A lot of current CPUs actually completely separate the carry flag and use flag-renaming (similar to register-renaming) to break the dependencies.
It's quite common for this old advice about inc/dec to still keep turning up today. A lot of old websites and documents are still around. But, the only way to know for sure is to test it. Test it in your real code. Not some artificial one billion times dedicated loop or something. |
|||
12 Oct 2019, 15:23 |
|
Tomasz Grysztar 12 Oct 2019, 15:34
revolution wrote: But, the only way to know for sure is to test it. Test it in your real code. Not some artificial one billion times dedicated loop or something. Code: format PE NX GUI include 'win32wx.inc' macro measured_call proc { push eax ecx edx invoke GetTickCount pop edx ecx xchg eax,[esp] call proc pushfd xchg eax,[esp+4] push ebx ecx edx mov ebx,eax invoke GetTickCount sub eax,ebx cinvoke wsprintf,debug_buffer,'%d ms',eax invoke MessageBox,0,debug_buffer,`proc,MB_ICONINFORMATION pop edx ecx ebx popfd pop eax } section '.text' code readable executable entry $ measured_call incdec measured_call addsub invoke ExitProcess,0 incdec: xor ecx,ecx .again: inc ebx inc edx inc eax dec esi dec edi dec ecx jnz .again ret addsub: sub ecx,ecx .again: add ebx,1 add edx,1 add eax,1 sub esi,1 sub edi,1 sub ecx,1 jnz .again ret section '.data' data readable writeable debug_buffer rb 1000h section '.idata' import data readable writeable library kernel32,'KERNEL32.DLL',\ user32,'USER32.DLL' include 'api\kernel32.inc' include 'api\user32.inc' section '.reloc' fixups data readable discardable On the other hand, the overall size of code can also affect the performance in a testable way. Oh, and one more example. The LOOP instruction is terribly slow on my CPU - but even if I replace all occurrences of LOOP in the source of fasmg, the assembly times (some of them quite long) remain the same. It makes too small difference to even put a dent on them. And by using LOOP there at least I have a smaller code. |
|||
12 Oct 2019, 15:34 |
|
revolution 13 Oct 2019, 13:30
Tomasz Grysztar wrote: Oh, and one more example. The LOOP instruction is terribly slow on my CPU - but even if I replace all occurrences of LOOP in the source of fasmg, the assembly times (some of them quite long) remain the same. It makes too small difference to even put a dent on them. And by using LOOP there at least I have a smaller code. LOOP is slow DEC/JNZ is fast But: LOOP is short DEC/JNZ is long So: The smaller LOOP is more friendly to cache, thus the "slowness" is compensated for by more efficient cache usage. Or: The "slowness" of LOOP is adding 1 million cycles to the run, and the cache is not more effective at all. But 1 million cycles is basically unmeasurable over runs of many tens of seconds for a modern CPU. Or: Something else completely. Who knows. Anyhow, just a guess. I don't the actual details of your CPU or the code. But certainly these things are never as simple as they might first appear. |
|||
13 Oct 2019, 13:30 |
|
Tomasz Grysztar 13 Oct 2019, 13:58
revolution wrote: Anyhow, just a guess. I don't the actual details of your CPU or the code. But certainly these things are never as simple as they might first appear. |
|||
13 Oct 2019, 13:58 |
|
revolution 14 Oct 2019, 09:26
Tomasz Grysztar wrote: ... I feel we know just as much as we knew before. Perhaps we can develop a standard answer to all the posts asking if some code will be faster: "Dunno. Test it!". |
|||
14 Oct 2019, 09:26 |
|
guignol 14 Oct 2019, 10:09
and that revolution is a dummy with a smoky granade
|
|||
14 Oct 2019, 10:09 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.