flat assembler
Message board for the users of flat assembler.
![]() |
Author |
|
Azu 22 May 2009, 04:20
In situations where either will work and they're the same size.. is one better then the other? And if so, why?
E.G. add eax,4 vs lea eax,[eax+4] |
|||
![]() |
|
manfred 22 May 2009, 05:50
Better in what?
_________________ Sorry for my English... |
|||
![]() |
|
sinsi 22 May 2009, 06:13
Same clocks, but lea won't change flags.
|
|||
![]() |
|
Azu 22 May 2009, 12:51
I mean like which one is optimized better/can run at the same time as other instructions the most?
|
|||
![]() |
|
bitRAKE 22 May 2009, 17:14
One of the great things about FASM is the ability to redefine instructions. So, to simplify testing, ADD could be used and then an ADD macro (changing ADD to LEA where applicable) could be conditionally included to compare the results.
|
|||
![]() |
|
Azu 23 May 2009, 01:35
Thank you
![]() Lea is faster when I test it, I don't have a bunch of CPUs to test on though, so was wondering which is faster on average. |
|||
![]() |
|
revolution 23 May 2009, 02:59
I don't think it is correct to ask "on average". What is an average computer?
|
|||
![]() |
|
Azu 23 May 2009, 03:03
Like if lea is faster in 30% of the cases and mov is faster in 70% of them, then mov is faster on average.
|
|||
![]() |
|
revolution 23 May 2009, 03:09
But there is no average case in general. It all depends upon your code, your algorithms and your computer that you are testing everything on.
Last edited by revolution on 23 May 2009, 14:21; edited 1 time in total |
|||
![]() |
|
pal 23 May 2009, 13:50
Why not do a benchmark test of a few codes where you change which instruction you use. E.g. execute the same arbituary statement like a million times or something and time the difference using a high precision timer. If it depends on the situation then you would need a variety of different codes I guess.
|
|||
![]() |
|
Borsuc 23 May 2009, 22:53
don't they result in same micro ops except the flag thing anyway?
|
|||
![]() |
|
LocoDelAssembly 23 May 2009, 22:58
It depends of the regs actually, I don't have the AMD manuals with me now but if I recall correctly "lea reg, [reg*1+reg]" will incurr in two cycles of latency, so even if you don't have an scaled index it can still be there (multiplying only by 1) because the encoding of the address required the SIB byte.
When what I've said above does not apply (i.e., the reg isn't implicitly scaled by one), perhaps the code could perform faster because you are releasing an ALU unit thanks to the fact that the address is calculated via an AGU? |
|||
![]() |
|
LocoDelAssembly 24 May 2009, 19:03
OK, what's happens here?
Code: format PE console 4.0 entry start include 'win32ax.inc' macro tester func { local ..loop invoke Sleep, 1000 xor eax, eax cpuid call [GetTickCount] mov [timestart], eax mov ebx, $80000000 call func ; Serialize xor eax, eax cpuid call [GetTickCount] sub eax, [timestart] push eax call @f db `func, 0 @@: push fmt call [printf] add esp, 12 align 16 } fmt db "%s: ", "%dms", 10, 0 timestart dd 0 start: invoke GetCurrentProcess invoke SetPriorityClass, eax, REALTIME_PRIORITY_CLASS invoke GetCurrentThread invoke SetThreadPriority, eax, THREAD_PRIORITY_TIME_CRITICAL tester lea_adder tester add_adder tester lea_adder_longer_chain tester add_adder_longer_chain invoke ExitProcess, 0 lea_adder: xor ebx, ebx mov eax, 1 align 16 .loop: lea ebx, [ebx+eax+1] imul ecx, ebx, 3 imul ecx, ebx, 3 test ecx, ecx jnz .loop ret add_adder: xor ebx, ebx mov eax, 2 align 16 .loop: add ebx, eax imul ecx, ebx, 3 imul ecx, ebx, 3 test ecx, ecx jnz .loop ret add_adder_longer_chain: xor ebx, ebx mov eax, 2 align 16 .loop: add ebx, eax imul ecx, ebx, 3 imul edx, ecx, 3 test edx, edx jnz .loop ret lea_adder_longer_chain: xor ebx, ebx mov eax, 1 align 16 .loop: lea ebx, [ebx+eax+1] imul ecx, ebx, 3 imul edx, ecx, 3 test edx, edx jnz .loop ret align 4 data import library msvcrt, 'msvcrt.dll',\ kernel32, 'kernel32.dll' import msvcrt,\ printf, 'printf' include 'api/kernel32.inc' end data Results: Code: C:\Documents and Settings\Hernan\Escritorio\Assembly>bench.exe lea_adder: 3781ms add_adder: 5406ms lea_adder_longer_chain: 4328ms add_adder_longer_chain: 3781ms C:\Documents and Settings\Hernan\Escritorio\Assembly>bench.exe lea_adder: 3781ms add_adder: 5406ms lea_adder_longer_chain: 4328ms add_adder_longer_chain: 3782ms C:\Documents and Settings\Hernan\Escritorio\Assembly>bench.exe lea_adder: 3781ms add_adder: 5390ms lea_adder_longer_chain: 4312ms add_adder_longer_chain: 3781ms (AMD Athlon64 Venice 2.0 GHz) [edit]Added two more tests[/edit] Last edited by LocoDelAssembly on 24 May 2009, 20:55; edited 1 time in total |
|||
![]() |
|
Borsuc 24 May 2009, 20:01
That's really interesting, my guess would be because the micro ops would be different?
_________________ Previously known as The_Grey_Beast |
|||
![]() |
|
Madis731 24 May 2009, 20:39
Actually Core 2 can schedule three consecutive ADDs in the same time as LEA. What has happened here is a misfortunate micro-op scheduling. LEA needs to go to port0 always, ADD can go to any 0,1 or 5. If for whatever reason ADD "chose" 1 or 5 and forced the IMULs to schedule in the following clocks instead then the loop went - for this reason - 1 clock longer.
My tests on 65nm Core 2 (T7200) showed: Code: D:\Programs\FASM\Proged\Bench_ADD_LEA>bench lea_adder: 2360ms add_adder: 2281ms D:\Programs\FASM\Proged\Bench_ADD_LEA>bench lea_adder: 2235ms add_adder: 2234ms D:\Programs\FASM\Proged\Bench_ADD_LEA>bench lea_adder: 2344ms add_adder: 2234ms If I changed the line LEA EBX,[EBX+EAX+1] to LEA EBX,[EBX+EAX] then the bench showed: Code: D:\Programs\FASM\Proged\Bench_ADD_LEA>bench lea_adder: 4375ms add_adder: 2235ms D:\Programs\FASM\Proged\Bench_ADD_LEA>bench lea_adder: 4547ms add_adder: 2234ms D:\Programs\FASM\Proged\Bench_ADD_LEA>bench lea_adder: 4391ms add_adder: 2234ms |
|||
![]() |
|
LocoDelAssembly 24 May 2009, 20:57
I have edited my post, please check.
|
|||
![]() |
|
pal 24 May 2009, 21:20
Intel Core2 Quad CPU 2.40 GHz:
Code: J:\My Files\Programming\fasmw16738\Codes>leaaddbench.exe lea_adder: 1794ms add_adder: 1904ms lea_adder_longer_chain: 1872ms add_adder_longer_chain: 1856ms J:\My Files\Programming\fasmw16738\Codes>leaaddbench.exe lea_adder: 1856ms add_adder: 1826ms lea_adder_longer_chain: 1856ms add_adder_longer_chain: 1825ms J:\My Files\Programming\fasmw16738\Codes>leaaddbench.exe lea_adder: 1856ms add_adder: 1841ms lea_adder_longer_chain: 1857ms add_adder_longer_chain: 1841ms So they are around the same. I would work out the standard deviation but I aint that bored. Gawd damn thing, I had to log in as an administrator and disable my AV to get this to work. I'll maybe have a test with my PS3 later if I can be bothered (just I'll have to configure it all). |
|||
![]() |
|
LocoDelAssembly 24 May 2009, 21:55
Seems that Intel does not exhibit the same behavior then.
Madis wrote:
But have you changed the "mov eax, 1" to "mov eax, 2" to keep the comparison fair? [edit] If I change lea ocurrencies to "lea ebx, [ebx+eax]" and change ocurrencies of "mov eax, 1" to "mov eax, 2" I get this results: Code: C:\Documents and Settings\Hernan\Escritorio\Assembly>bench.exe lea_adder: 5406ms add_adder: 5406ms lea_adder_longer_chain: 3782ms add_adder_longer_chain: 3782ms C:\Documents and Settings\Hernan\Escritorio\Assembly>bench.exe lea_adder: 5406ms add_adder: 5406ms lea_adder_longer_chain: 3781ms add_adder_longer_chain: 3781ms C:\Documents and Settings\Hernan\Escritorio\Assembly>bench.exe lea_adder: 5391ms add_adder: 5391ms lea_adder_longer_chain: 3781ms add_adder_longer_chain: 3781ms |
|||
![]() |
|
revolution 25 May 2009, 01:24
All of those tests are artificial. None of those results will help you in a real program.
|
|||
![]() |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.