flat assembler
Message board for the users of flat assembler.
Index
> Main > A question on performance Goto page Previous 1, 2, 3 Next |
Author |
|
revolution 12 Jan 2017, 15:19
system error wrote: That may not be the actual microcode but that shows the amount of WORK (cpu cycles) to actually implement a RET instuction. |
|||
12 Jan 2017, 15:19 |
|
system error 12 Jan 2017, 15:39
revolution wrote:
CALL/RET is directly supported by hardware, I agree. But that's not the point. The issue is with LOOP and SUB/JNZ which takes us to the discussion of simple vs complex instructions. I did time my code (see the previous codes). Replacing LOOP with SUB/JNZ yields better clocker in 10,000,000 loops. (You must be sleeping then). My point, the general strategy if you are aiming for speed is to try minimize complex instructions and replace then with the plain counterparts. You can conduct your own tests on STOSB, LOOP if you want. Nobody is stopping you. And another point on optimization I'd like to share; 2. Use the registers as the way they were originally designed for. For example, you should ECX/RCX for loop count instead of other registers like R15 or EDI. This yields faster results. Try it |
|||
12 Jan 2017, 15:39 |
|
Xorpd! 13 Jan 2017, 05:49
I haven't written any FASM in a while, so I thought I would try to get back in practice. In the original snippets of this thread, the processor wasn't clearly defined and the results were given in terms of wall clock times rather than the more useful processor cycles.
I attempted to remedy the situation by offering code that times empty code and both snippets each 5 times and prints out the results in processor cycles. Code: format PE64 console entry start section '.text' code readable executable start: sub rsp, 8*5 ; Get Console handle mov ecx, -11 ; STD_OUTPUT_HANDLE call [GetStdHandle] mov [ConsoleHandle], rax ; Write header for logo and qword [rsp], 0 mov r9, _written mov r8, logo_size mov rdx, _logo mov rcx, [ConsoleHandle] call [WriteConsole] ; Get logo mov eax, 80000002h cpuid mov [_brand], EAX mov [_brand+4], EBX mov [_brand+8], ECX mov [_brand+12], EDX mov eax, 80000003h cpuid mov [_brand+16], EAX mov [_brand+20], EBX mov [_brand+24], ECX mov [_brand+28], EDX mov eax, 80000004h cpuid mov [_brand+32], EAX mov [_brand+36], EBX mov [_brand+40], ECX mov [_brand+44], EDX ; Write logo mov rcx, _brand call [lstrlen] and qword [rsp], 0 mov r9, _written mov r8, rax mov rdx, _brand mov rcx, [ConsoleHandle] call [WriteConsole] call newline ; Get family, model, stepping mov eax, 1 cpuid mov [_fms], eax ; write family mov rcx, _brand mov rdx, _family_fmt mov r8d, [_fms] shr r8d, 8 and r8d, 0fh call [wsprintf] mov rcx, _brand call [lstrlen] and qword [rsp], 0 mov r9, _written mov r8, rax mov rdx, _brand mov rcx, [ConsoleHandle] call [WriteConsole] ; write extended family mov rcx, _brand mov rdx, _extended_family_fmt mov r8d, [_fms] shr r8d, 20 and r8d, 0ffh call [wsprintf] mov rcx, _brand call [lstrlen] and qword [rsp], 0 mov r9, _written mov r8, rax mov rdx, _brand mov rcx, [ConsoleHandle] call [WriteConsole] ; write model mov rcx, _brand mov rdx, _model_fmt mov r8d, [_fms] shr r8d, 4 and r8d, 0fh call [wsprintf] mov rcx, _brand call [lstrlen] and qword [rsp], 0 mov r9, _written mov r8, rax mov rdx, _brand mov rcx, [ConsoleHandle] call [WriteConsole] ; write extended model mov rcx, _brand mov rdx, _extended_model_fmt mov r8d, [_fms] shr r8d, 16 and r8d, 0fh call [wsprintf] mov rcx, _brand call [lstrlen] and qword [rsp], 0 mov r9, _written mov r8, rax mov rdx, _brand mov rcx, [ConsoleHandle] call [WriteConsole] ; write stepping mov rcx, _brand mov rdx, _stepping_fmt mov r8d, [_fms] shr r8d, 8 and r8d, 0fh call [wsprintf] mov rcx, _brand call [lstrlen] and qword [rsp], 0 mov r9, _written mov r8, rax mov rdx, _brand mov rcx, [ConsoleHandle] call [WriteConsole] mov ebx, 0 ; Time successive RDTSCs empty: rdtsc mov dword [_times+8*rbx], eax mov dword [_times+8*rbx+4], edx rdtsc sub eax, dword [_times+8*rbx] sbb edx, dword [_times+8*rbx+4] mov dword [_times+8*rbx], eax mov dword [_times+8*rbx+4], edx inc ebx cmp ebx, 5 jb empty ; Time with single instruction in loop op1: mov rax, 3ff0000000000000h movq xmm0, rax inc rax movq xmm1, rax rdtsc mov dword [_times+8*rbx], eax mov dword [_times+8*rbx+4], edx mov rcx, 10000000 .spin: mulsd xmm0, xmm1 loop .spin rdtsc sub eax, dword [_times+8*rbx] sbb edx, dword [_times+8*rbx+4] mov dword [_times+8*rbx], eax mov dword [_times+8*rbx+4], edx inc ebx cmp ebx, 10 jb op1 ; Time with 2 instructions in loop op2: mov rax, 3ff0000000000000h movq xmm0, rax inc rax movq xmm1, rax rdtsc mov dword [_times+8*rbx], eax mov dword [_times+8*rbx+4], edx mov rcx, 10000000 .spin: mulsd xmm0, xmm1 cvtsd2si rax, xmm0 loop .spin rdtsc sub eax, dword [_times+8*rbx] sbb edx, dword [_times+8*rbx+4] mov dword [_times+8*rbx], eax mov dword [_times+8*rbx+4], edx inc ebx cmp ebx, 15 jb op2 ; Write out times for empty code and qword [rsp], 0 mov r9, _written mov r8, empty_size mov rdx, _empty mov rcx, [ConsoleHandle] call [WriteConsole] mov ebx, 0 write_empty: mov rcx, _brand mov rdx, _decimal_fmt mov r8d, dword [_times+8*rbx] call [wsprintf] mov rcx, _brand call [lstrlen] and qword [rsp], 0 mov r9, _written mov r8, rax mov rdx, _brand mov rcx, [ConsoleHandle] call [WriteConsole] inc ebx cmp ebx, 5 jb write_empty call newline ; Write out times for single instruction loop and qword [rsp], 0 mov r9, _written mov r8, op1_size mov rdx, _op1 mov rcx, [ConsoleHandle] call [WriteConsole] write_op1: mov rcx, _brand mov rdx, _decimal_fmt mov r8d, dword [_times+8*rbx] call [wsprintf] mov rcx, _brand call [lstrlen] and qword [rsp], 0 mov r9, _written mov r8, rax mov rdx, _brand mov rcx, [ConsoleHandle] call [WriteConsole] inc ebx cmp ebx, 10 jb write_op1 call newline ; Write out times for 2 instruction loop and qword [rsp], 0 mov r9, _written mov r8, op2_size mov rdx, _op2 mov rcx, [ConsoleHandle] call [WriteConsole] write_op2: mov rcx, _brand mov rdx, _decimal_fmt mov r8d, dword [_times+8*rbx] call [wsprintf] mov rcx, _brand call [lstrlen] and qword [rsp], 0 mov r9, _written mov r8, rax mov rdx, _brand mov rcx, [ConsoleHandle] call [WriteConsole] inc ebx cmp ebx, 15 jb write_op2 call newline call [ExitProcess] ; Output newline newline: sub rsp, 8*5 and qword [rsp], 0 mov r9, _written mov r8, crlf_size mov rdx, _crlf mov rcx, [ConsoleHandle] call [WriteConsole] add rsp, 8*5 ret section '.data' data readable writeable ConsoleHandle rq 1 _written rq 1 _times rq 15 _brand rd 12 _fms rd 1 _logo db 'Processor: ' logo_size = $-_logo _crlf db 13,10 crlf_size = $-_crlf _family_fmt db 'Family: %X',10,0 _extended_family_fmt db 'Extended Family: %X',10,0 _model_fmt db 'Model: %X',10,0 _extended_model_fmt db 'Extended Model: %X',10,0 _stepping_fmt db 'Stepping: %X',10,0 _empty db 'Empty:' empty_size = $-_empty _op1 db '1 operation:' op1_size = $-_op1 _op2 db '2 operations:' op2_size = $-_op2 _decimal_fmt db ' %ld',0 section '.idata' import data readable writeable dd 0,0,0,RVA kernel_name,RVA kernel_table dd 0,0,0,RVA user_name,RVA user_table kernel_table: GetStdHandle dq RVA _GetStdHandle WriteConsole dq RVA _WriteConsole lstrlen dq RVA _lstrlen ExitProcess dq RVA _ExitProcess dq 0 user_table: wsprintf dq RVA _wsprintf dq 0 kernel_name db 'kernel32.dll',0 user_name db 'user32.dll',0 _GetStdHandle dw 0 db 'GetStdHandle', 0 _WriteConsole dw 0 db 'WriteConsoleA', 0 _lstrlen dw 0 db 'lstrlenA', 0 _wsprintf dw 0 db 'wsprintfA', 0 _ExitProcess dw 0 db 'ExitProcess', 0 The results on my Atom processor: Code: D:\FASM\code\atom_test>atom_test Processor: Intel(R) Atom(TM) x5-Z8330 CPU @ 1.44GHz Family: 6 Extended Family: 0 Model: C Extended Model: 4 Stepping: 6 Empty: 36 36 36 18 36 1 operation: 111028680 98019792 98329248 97719498 97695216 2 operations: 82669392 83004282 82680390 82665396 82668150 |
|||
13 Jan 2017, 05:49 |
|
system error 13 Jan 2017, 11:57
xorpd
if u changed LOOP with SUB/JNZ, the result will be the opposite. So the real culprit to most slowdowns is the complex and the 'wrapper' instructions - not SSE instructions as I initially thought. They always give people such fake impressions of 'cleaner and shorter' code without considering the real history behind such ABBA-era, Vietnam War mnemonics. But the worst kind are from ASM programmers coming from those era forcing people to stick to 'their' old school 'conventions' on modern CPUs with complete disregards on how those old instructions actually perform on 'our' CPUs. That's where you get most negative responses from every time you want to talk about optimization on this board. |
|||
13 Jan 2017, 11:57 |
|
Tomasz Grysztar 13 Jan 2017, 13:26
system error wrote: 2. Use the registers as the way they were originally designed for. For example, you should ECX/RCX for loop count instead of other registers like R15 or EDI. This yields faster results. Try it My personal method is to optimize individual instructions for speed only in the critical sections of code, and only when it gives an actual gain on the machines where the code is going to be used. Otherwise, I go mostly for the size optimization. Changing instruction to a longer but faster equivalent in a non-critical routine usually gives no noticeable speed gain, while using shorter code may often add up to a noticeable effect overall and is a more universal advantage. For example, you may find LOOP used in a few places in fasmg (even more in old fasm), usually where I considered it the aesthetically pleasing choice. Replacing them with alternatives yield no perceivable difference in execution time (at least not on the machines I've been working with). On the other hand, the shorter length of code adds up to measurable differences, sometimes to an important effect (like when I strove to keep fasm working with the 64k code segment limit of unreal mode). |
|||
13 Jan 2017, 13:26 |
|
Xorpd! 13 Jan 2017, 16:15
@system error, I was hoping you would just try running my code and post the results so those following the thread could at last determine which processor you have and see the performance in processor cycles.
|
|||
13 Jan 2017, 16:15 |
|
system error 13 Jan 2017, 17:58
xorpd, I've given you the suggestion to run the code for the other case (using SUB/JNZ in place of LOOP). Since you are already at it, then show us the result for both cases. What's 'wrong' with you?
Your code is a bit off to my liking. I mean what kind of hippopotamus that would perform 2 COSTLY non-critical memory write instructions in between 2 RDTSC? Dont you think they will commit even more cpu cycles to the entire result than the ones we are actually interested in testing? What's wrong with placing them in two registers? Assigning rcx before RDTSC is even smarter, my son. |
|||
13 Jan 2017, 17:58 |
|
system error 13 Jan 2017, 18:07
Tomasz Grysztar wrote:
Well, since the rest of us here are non-compiler writers, I am quite sure this discussion is quite irrelevant to FASM internal design. |
|||
13 Jan 2017, 18:07 |
|
Tomasz Grysztar 13 Jan 2017, 18:57
system error wrote: Well, since the rest of us here are non-compiler writers, I am quite sure this discussion is quite irrelevant to FASM internal design. |
|||
13 Jan 2017, 18:57 |
|
system error 13 Jan 2017, 19:32
TG, I don't see any relevancy of this topic to what you doing in FASM internal. This is application-level optimization. FASM compiler basically translates user codes to their machine counterparts. If the programmer chose to use slower instructions, the compiler translates them as exactly - can't blame the assembler. Do you have something else in mind?
|
|||
13 Jan 2017, 19:32 |
|
Trinitek 13 Jan 2017, 19:53
system error wrote: TG, I don't see any relevancy of this topic to what you doing in FASM internal. This is application-level optimization. FASM compiler basically translates user codes to their machine counterparts. If the programmer chose to use slower instructions, the compiler translates them as exactly - can't blame the assembler. Do you have something else in mind? |
|||
13 Jan 2017, 19:53 |
|
Tomasz Grysztar 13 Jan 2017, 19:54
system error wrote: TG, I don't see any relevancy of this topic to what you doing in FASM internal. This is application-level optimization. FASM compiler basically translates user codes to their machine counterparts. If the programmer chose to use slower instructions, the compiler translates them as exactly - can't blame the assembler. Do you have something else in mind? |
|||
13 Jan 2017, 19:54 |
|
system error 13 Jan 2017, 20:30
Trinitek wrote:
If I give you one .obj file to be linked by a linker, how to tell that it was compiled with FASM and not by NASM? See what I meant by relevancy here? |
|||
13 Jan 2017, 20:30 |
|
system error 13 Jan 2017, 20:37
Tomasz Grysztar wrote:
Well of course. That's why I am a big fan of FASM Here's the thing Tomasz - the fact that you been keeping yourself away from optimization discussions all these long gave me the false impressions that you were not particularly interested in this type of bloodbath. So when you interfered, it gives me the creeps and quite shocking tbh. Well, that's quite natural I think considering your position. So my general idea is to keep you away! hahaha |
|||
13 Jan 2017, 20:37 |
|
Trinitek 13 Jan 2017, 21:00
system error wrote:
|
|||
13 Jan 2017, 21:00 |
|
system error 13 Jan 2017, 23:18
Trinitek
The discussions on performance are the same across different compilers. The talking points we are discussing here also apply to MASM, NASM, POASM boards as well, because once they been turned into binaries, they all look the same to the CPU/linker regardless of compilers/assemblers used to produce such binaries. At this very point, the assembler/compiler used is irrelevant. If you want to talk about compiler optimization (time, size), you go to Compiler Internals as I already did here. |
|||
13 Jan 2017, 23:18 |
|
fragment 14 Jan 2017, 14:59
AsmGuru62 wrote: Welcome to the forum thanx ... so you use 'Loop' as representation for other higher level instructions to optimize your approach (easier coding, more readable source etc.) and 'sub reg,1/jcc' constructions as representation for instructions which lack a bit on this aspect but optimize the code ... let's look at the result: most of your code is unoptimized/suboptimal/less optimal (the one you considered as less relevant) except for a few fragments which you considered as relevant ... if I asked you now about this: a) easier coding but suboptimal code b) suboptimal coding but optimal code c) mixing a) and b) and changing what you priorize d) general as easy as possible coding AND general the best possible code what would you chose now? Quote: I must add also that your figure of ~100% is probably not correct. I did not say/meant that the Loop-Instruction is ~100% slower on all computers - which would anyway not fit to my statement about my I7 and to my general conclusion. But in case I was a bit unclear: this ~100% was about my laptop etc. |
|||
14 Jan 2017, 14:59 |
|
Tomasz Grysztar 14 Jan 2017, 15:55
fragment wrote: if I asked you now about this: And even when the speed is your only priority, just like there may exist many possible solutions for code resolving and fasm arrives at just one, there may exist many different variants of code that would be just as "optimal" (that is: just as fast). For instance it might happen that when you replace all the complex instructions in a block of code with their faster equivalents, you make the entire block longer just as much as to make it not fit well in cache - and it may turn out that both variants execute equally fast (though for completely different reasons). As revolution likes to point out, this is all very context dependent and any rule of thumb may be completely wrong in some cases. So my approach is: write a good and well structured code and this should pay out all by itself. When a speed of a particular bunch of instructions has a negligible effect on the overall execution, choose a variant that fits well the structure of your program and is the most elegant (this is subjective, of course - use your own style). The "hard" instruction-level optimization should come last and only where and when it is really required. |
|||
14 Jan 2017, 15:55 |
|
AsmGuru62 14 Jan 2017, 16:08
I am not sure I understand your question on a,b,c,d.
I write code currently using FASM and I will do the following: 1. If any of my loops or functions dealing with a lot of data - I will optimize as per Intel Manuals: no complex instructions (LOOP is out), using mostly registers, aligning labels, etc. 2. If any of my loops or functions are NOT dealing with a lot of data, then I will prefer the smaller (in bytes) code. I will use LOOPs - yes. However, before coding I will design the proper algorithm. |
|||
14 Jan 2017, 16:08 |
|
Goto page Previous 1, 2, 3 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.