flat assembler
Message board for the users of flat assembler.
Index
> Main > A question on performance Goto page 1, 2, 3 Next |
Author |
|
system error 25 Aug 2016, 23:43
No, aligning the spin won't help the first version either. You just have to add one more instruction before or after mulsd to make it faster. A simple nop should do, actually. What I don't understand is, WHY?
|
|||
25 Aug 2016, 23:43 |
|
rugxulo 26 Aug 2016, 02:14
(N.B. I know effectively nothing about optimization. It's still interesting but indeed a minefield. Sorry, just giving some general tips. Caveat emptor!)
system error wrote: I am on Atom. This tells us nothing. There are too many Atoms. (BTW, I have none.) Wikipedia mentions "Silverthorne, Diamondville, Pineview, Lincroft, Centerton, Avoton, Silvermont". While I don't see anything obvious from a quick check of newer GCC docs, IIRC, -march/-mtune "atom" was introduced with GCC 4.5.x (2010). I assume "native" would also work (if self-hosted). Quote:
It's very inaccurate to test (or trust) such short-run benchmarks. Anything less than a few seconds is unreliable, IMHO. Power management can also adversely affect speed. Quote:
fetch, decode, execute ... clocks, hertz ... latency, thoroughput ... pipelining, pairing, cpu cache(s) size, prefetching, alignment ... multi-core, power management, threads ... FPU + SSE ... removing dead code, unrolling loops, strength reduction ... register spilling, stalls ... static or dynamic branch prediction ... micro ops, macro-op fusion ... WHO THE HECK KNOWS?? (Although I've never bothered, you may need to use Linux "perf stat" or "oprofile" or something commercial like Intel vTune or AMD CodeAnalyst or whatever. I doubt RDTSCP is enough help by itself.) Quote:
The only guaranteed optimizations are for size! Smaller is always smaller, no matter which family / model / stepping! Anyways, it's all a very complex subject. Read Agner Fog's manuals. To be honest, you should first make sure your program runs correctly (including verifying by test suite), then improve the overall (high-level) algorithm, reduce unnecessary slow disk access, and then worry about micro optimizations at the cpu level. (Often times speed isn't important anyways, it's only executed occasionally. There are other concerns like readability or portability.) As much as I really hate to say it, but sometimes hand-written assembly is slower than HLLs. GCC is fairly complicated and smart by now. (For instance, I still haven't proven it, but I think one silly program of mine is tons slower because it's not disk-buffered behind the scenes automatically. So the C version is always tons faster, even with the oldest and dumbest compilers. But mine is smaller, of course.) Quote: No, aligning the spin won't help the first version either. You just have to add one more instruction before or after mulsd to make it faster. A simple nop should do, actually. What I don't understand is, WHY? AFAIK, "pause" is an alias for "rep nop", meant for spin loops. Though I don't know if that will help much here. |
|||
26 Aug 2016, 02:14 |
|
system error 26 Aug 2016, 04:14
Thanks for the great answers rugx, although I don't understand half of what you saying because I a complete noob on optimization.
What puzzles me is that the second snippet runs a lot faster than the first despite the fact that; 1. there's an obvious dependency on XMM0 2. cvtsd2si is a 23-clocker So just by looking at the code, most people would believe that the first one runs faster than the other (who can blame them?), while it actually the opposite. It breaks my heart to know that at some point, all else equal, adding more instructions can actually make your code runs faster, contrary to my previous belief. Nobody told me this before -_- Even the honorable Mr Agner Fog never mention this or never really put enough emphasis on this type of optimization technique. I guess he's getting balder and older to keep up with the latest offerings. hehehe |
|||
26 Aug 2016, 04:14 |
|
revolution 26 Aug 2016, 07:46
Welcome the the world of optimisation. Optimisation is hard. Really hard. And sometimes counter-intuitive.
But your test above is not representative of real world usage. You can't substitute a contrived benchmark and expect it to relate to actual use in a real app. My suggestion is to not bother with specialised "performance" loops and similar things, they will tell you nothing useful when you want to apply it to the final product. Also, testing on only one system tells you nothing about performance on other systems. I guess all of the above comes across as negative, but really, this type of testing above gives no valuable results in almost all situations. Your statement above "adding more instructions can actually make your code runs faster" is an example of the result of the internal complexity of contemporary CPUs. But it is not always true of course, and it will only be true in certain specific situations. But once again, contriving a specific example is not too hard, but it won't necessarily still be true when code it transplanted into another situation. So in conclusion: Always do your testing on the real world application. |
|||
26 Aug 2016, 07:46 |
|
rugxulo 26 Aug 2016, 19:36
system error wrote:
Who knows? Speculative execution? Register renaming? Again, pipelining roughly means that it can start one instruction and finish in the background while also starting newer ones. Or maybe "loop" is flushing the instruction queue prematurely? Quote:
Which is still blazingly fast on multi-ghz cpus. Quote:
Supposedly the 8086 was more efficient for small and tight code, same as even 386. But for the 486 (pipelining!), it was faster to use simpler RISC-y instructions (mov [di],al // inc di) instead of CISC instructions (stosb). Even the Pentium / 586 (or, should I say 587) was pipelined, allegedly working faster if you interleave FMULs and FADDs (or whatever) to allow them to cooperate, not to mention the U (full) and V (weak) pipes, which was a big deal for compilers back then (e.g. GCC 2.8.x), requiring a recompile to really give significant speedups. 486 itself was allegedly very sensitive to alignment. (I can also guarantee you that ENTER and LEAVE are much much slower on this [Westmere] Core i5 than simple 8086 instructions. Supposedly they were faster on an actual 186 [clone?], but not anymore.) Quote:
Modern cpus are very very sophisticated. They try insanely hard to figure things out on their own. I found an old (1999) Dr. Dobbs article the other day on MMX. Here's just a small excerpt to prove my point: Max I. Fomitchev wrote:
See? Even back then it was complicated. Nowadays it's even MORE complicated! Ugh. Quote:
There's a lot of reasons. It just takes further study. Don't stress too hard, most things aren't so extremely timing sensitive. Treat it as a hobby, learn as you go, and have fun. Just explore and investigate. At worst, like I said, you learn new instructions or better ways to optimize for size, which (unless it horribly slows down everything, but that's rare) is always good in a pinch. |
|||
26 Aug 2016, 19:36 |
|
cod3b453 26 Aug 2016, 22:06
It's very implementation specific, but most recent CPUs have pipelines with depth around 20~30 stages and, depending on the implementation, any branch could incur a flush delay in the order of as many cycles. Hence, you can add an instruction with similar delay without incurring negative effects in the overall time.
In this particular case, the initial tight loop on the same instruction could be hiding further delay due to higher-than-usual instruction fetches, causing additional stalls, and saturation of the bus, which is likely to be wide enough for several instructions and so adding more instructions has no negative effect (possibly even a positive effect). |
|||
26 Aug 2016, 22:06 |
|
system error 10 Sep 2016, 06:06
Thanks for the valuable input, guys. Now here's another test I run on standard PUSH-POP operations. And this one not even mentioned anywhere (agner fog, textbooks etc)
Code: mov rcx,10000000 spin: ;Standard RTFM version ;clocks 0.029xx - 0.039xx, average 0.033xx ;stalled cycles backends push rax push rbx push rcx push rdx pop rdx pop rcx pop rbx pop rax ;Not-mentioned-in-any-AgnerFog's version ;clocks 0.023xx - 0.033xx, average 0.027xx ;No stalled cycles on both ends ;sub rsp,8*4 ;mov [rsp+24],rdx ;mov [rsp+16],rcx ;mov [rsp+8],rbx ;mov [rsp],rax ;mov [rsp+24],rdx ;mov [rsp+16],rcx ;mov [rsp+8],rbx ;mov [rsp],rax ;add rsp,8*4 sub rcx,1 jnz spin I am starting to believe that the "read agner fog's" song is getting pretty obsolete, at least down at the very basic level. |
|||
10 Sep 2016, 06:06 |
|
system error 10 Sep 2016, 06:10
Oh btw, do avoid LOOP instruction like a plague.
|
|||
10 Sep 2016, 06:10 |
|
revolution 10 Sep 2016, 10:25
system error wrote: Oh btw, do avoid LOOP instruction like a plague. |
|||
10 Sep 2016, 10:25 |
|
AsmGuru62 10 Sep 2016, 13:21
Why, indeed.
I use LOOP all the time and never seen a slowdown. |
|||
10 Sep 2016, 13:21 |
|
DimonSoft 10 Sep 2016, 14:25
revolution wrote: Why? AsmGuru62 wrote: Why, indeed. Might be something about the fact that complex CISC-style instructions generally tend to have bad impact on modern processors’ pipelining than simpler RISC-style instructions. On the other hand smaller code might be better in terms of caches and processors’ internal instruction prefetch buffers. This plus the fact that most pieces of code are not bottlenecks, and in most cases we don’t have any visible difference. |
|||
10 Sep 2016, 14:25 |
|
revolution 10 Sep 2016, 14:43
Indeed, it is situation dependant. I don't believe any rules-of-thumb will be correct is every situation.
|
|||
10 Sep 2016, 14:43 |
|
fragment 12 Jan 2017, 10:12
good morning, folx ...
yep, I am new! Revolution/AsmGuru62 'loop' is very handy but it seems to be unfortunately significant slow on many many platforms: http://stackoverflow.com/questions/35742570/why-is-the-loop-instruction-slow-couldnt-intel-have-implemented-it-efficiently I am writing at this very moment on a typical/cheap and relatively recent mass market laptop: Lenovo/Silvermont/N2830 with a similar experience: the 'loop' instruction on this computer is ~100% slower than some other instruction alternatives. My performance test code is quasi-compatible with the one from Intel which uses in difference to Agner Fog also the newer RDTSCP/CPUID combination. But anyway, it doesn't really matter in this case. Even if I just started/stopped manually testLoops by key watching the kitchen clock: the general pattern were obvious. However, on my other computer (that's an I7/i860/Lynnfield/Nehalem) the 'loop'-instruction behaves simliar to the other ones - so no huge problem here ... My conclusion so far is, that if you want write more optimized code then you should avoid using 'loop'. It seems to be never the fastest (at best not slower) on the relevant computers in use but on many definitely slower ... and I might have overseen it, but if I recall it correctly then neither Intel ("64-ia-32-architectures-optimization-manual") nor Agner Fog in his opti-guides are ever using 'loop'. People like Mark Larson advice explicitly not to use it: http://www.mark.masmcode.com/ (please look for: "3. complex instructions") |
|||
12 Jan 2017, 10:12 |
|
AsmGuru62 12 Jan 2017, 13:47
Welcome to the forum!
When we talk about the performance - we should always consider the need for optimization. I will give you two examples: #1: Lets say I am displaying a dialog box with a list inside filled with some items. Depending on the amount of items - I will use LOOP or I will use the more performant option: Code: @@: ... sub ecx, 1 jnz @r #2: You are writing a text editor and you need to open 100's of files in one shot. Those files must be all parsed line-by-line for some features, etc. In this case I will definitely be optimizing right away, without even measuring the code. I will align all labels and will not use any LOOP's, because amount of processing is large. In cases where you suspect that LOOP is slowing you down - you need to measure how much time your code takes and make a decision based on that, just like revolution pointed out. Edited: I must add also that your figure of ~100% is probably not correct. I measured once my code where I used LOOP vs SUB/JNZ and I came up with ~15% slowdown. As I mentioned a case of a dialog - this is the time taken for a dialog to be filled with items and shown to user. Human perception will fail to see a difference in case of small amount of items. |
|||
12 Jan 2017, 13:47 |
|
system error 12 Jan 2017, 14:46
Here's my conclusions on performance. Not much but not wrong either.
If you're considering to optimize your code, treat all complex, CISC-style instructions with extreme prejudice and replace them with their plain counterparts. Instructions like LOOP, STOS, PUSH, POP, and even CALL and RET are generally slower on modern CPUs. Wait, did I mention CALL / RET? Reasons; - These high instructions are just a wrapper instructions to their plain siblings when looong time ago memory space was very little and scarce. So intel decided to come up with shorter instructions mnemonics to save up space at the expense of speed. - These high instructions share the same circuitry with their plain RISC-style siblings. Well, unless AMD/Intel dedicate special circuitry to these instructions, there's no reason to favor these instructions over the others. - Complex instructions commit more MICROCODE time if compared to their plain RISC siblings. Here's the path taken by the RET instruction. So while people talk about instructio cache, they tend to forget about MICROCODE. Faster RISC-style instruction for RET Code: add esp,4 jmp dword[esp-4] Slower CISC-based instruction RET (microcode) Code: switch(Instruction) { case NearReturn: if(OperandSize == 32 && !IsWithinStackLimits(TopStackBytes(12))) Exception(SS(0)); //top 12 bytes of stack not within stack limits //OperandSize == 16 else if(!IsWithinStackLimits(TopStackBytes(6)) Exception(SS(0)); //IF top 6 bytes of stack not within stack limits TemporaryEIP = Pop(); TemporaryEIP = TemporaryEIP & 0xFFFF; if(!IsWithinCodeSegmentLimits(TemporaryEIP)) Exception(GP(0)); EIP = TemporaryEIP; if(HasImmediateOperand()) { //instruction has immediate operand if(StackAddressSize == 32) ESP = ESP + Source; //release parameters from stack //StackAddressSize == 16 else SP = SP + Source; //release parameters from stack } break; case FarReturn: //Real-address mode or virtual-8086 mode if(PE == 0 || (PE == 1 && VM == 1)) { if(OperandSize == 32) { if(!IsWithinStackLimits(TopStackBytes(12)) Exception(SS(0)); //top 12 bytes of stack not within stack limits EIP = Pop(); CS = Pop(); //32-bit pop, high-order 16 bits discarded } else { //OperandSize == 16 if(!IsWithinStackLimits(TopStackBytes(6)) Exception(SS(0)); //top 6 bytes of stack not within stack limits TemporaryEIP = Pop(); TemporaryEIP = TemporaryEIP & 0xFFFF; if(!IsWithinCodeSegmentLimits(TemporaryEIP)) Exception(GP(0)); EIP = TemporaryEIP; CS = Pop(); //16-bit pop } if(HasImmediateOperand()) else SP = SP + Source; //instruction has immediate operand; release parameters from stack } //Protected mode, not virtual-8086 mode else if(PE == 1 && VM == 0) { if(OperandSize == 32 && !IsWithinStackLimits(OffsetStackBytes(4, 4)) Exception(SS(0)); //second doubleword on stack is not within stack limits //OperandSize == 16 else if(!IsWithinStackLimits(OffsetStackBytes(2, 2))) Exception(SS(0)); //second word on stack is not within stack limits if(ReturnCode.SegmentSelector == 0) Exception(GP(Selector)); if(!IsWithinDescriptorTableLimits(ReturnCode.SegmentSelector)) Exception(GP(Selector)); ReturnCode.SegmentDescriptor = ObtainSegmentDescriptor(); //Obtain descriptor to which return code segment selector points from descriptor table if(!IsCodeSegment(ReturnCode.SegmentDescriptor)) Exception(GP(Selector)); if(ReturnCode.SegmentSelector.RPL < CPL) Exception(GP(Selector)); if(IsConforming(ReturnCode.SegmentDescriptor && ReturnCode.Segment.DPL > ReturnCode.SegmentSelector.RPL) Exception(GP(Selector)); if(!IsPresent(ReturnCode.SegmentDescriptor)) Exception(NP(Selector)); if(ReturnCode.SegmentSelector.RPL > CPL) { //Return outer privilege level if(OperandSize == 32 && !IsWithinStackLimits(TopStackBytes(16 + Source)) Exception(SS(0)); //top 16 + Source bytes of stack not within stack limits //OperandSize == 16 else if(!IsWithinStackLimits(TopStackBytes(8 + Source)) Exception(SS(0)); //top 8 + Source bytes of stack not within stack limits ReturnSegmentSelector = ReadReturnSegmentSelector(); if(StackSegmentSelector == 0) Exception(GP(0)); if(!IsWithinDescriptorTableLimits(ReturnStack.SegmentSelector.Index)) Exception(GP(Selector)); if(StackSegmentSelector.RPL != ReturnCode.SegmentSelector.RPL || !IsWritableDataSegment(StackSegment) || StackSegmentDescriptor.DPL != ReturnCode.SegmentSelector.RPL) Exception(GP(Selector)); if(!IsPresent(StackSegment)) Exception(SS(StackSegmentSelector)); if(!IsWithinLimits(ReturnCode.SegmentLimit, ReturnInstructionPointer)) Exception(GP(0)); CPL = ReturnCode.SegmentSelector.RPL; if(OperandSize == 32) { EIP = Pop(); CS = Pop(); //32-bit pop, high-order 16 bits discarded; segment descriptor information also loaded CS.RPL = CPL; ESP = ESP + Source; //release parameters from called procedure's stack TemporaryESP = Pop(); TemporarySS = Pop(); //32-bit pop, high-order 16 bits discarded; segment descriptor information also loaded ESP = TemporaryESP; SS = TemporarySS; } //OperandSize == 16 else { EIP = Pop(); EIP = EIP & 0xFFFF; CS = Pop(); //16-bit pop; segment descriptor information also loaded CS.RPL = CPL; ESP = ESP + Source; //release parameters from called procedure's stack TemporaryESP = Pop(); TemporarySS = Pop(); //16-bit pop; segment descriptor information also loaded ESP = TemporaryESP; SS = TemporarySS; } SegmentRegisters[] = {ES, FS, GS, DS}; while(SegmentRegister = NextSegmentRegister(SegmentRegisters)) { if(IsDataPointer(SegmentRegister) || !IsConformingCodeSegment(SegmentRegister) && CPL > SegmentDescriptor.DPL /*DPL in hidden part of segment register*/) SegmentSelector = 0; //segment register is invalid, null segment selector if(!IsWithinDescriptorTableLimits(SegmentSelector.Index) || (!IsData(SegmentDescriptor) && !IsReadableCodeSegment(SegmentDescriptor)) || (IsData(SegmentDescriptor) && !IsConformingCodeSegment(SegmentDescriptor) && SegmentDescriptor.DPL < CPL && SegmentDescriptor.DPL < CodeSegment.SegmentSelector.RPL)) SegmentSelectorRegister = NullSelector; ESP = ESP + Source; //release parameters from called procedure's stack } } else { //Return to same privilege level if(!IsWithinLimits(ReturnCode.SegmentLimit, ReturnInstructionPointer)) Exception(GP(0)); if(OperandSize == 32) { EIP = Pop(); CS = Pop(); //32-bit pop, high-order 16 bits are discarded ESP = ESP + Source; //Release parameters from stack } else { //OperandSize == 16 EIP = Pop(); EIP = EIP & 0xFFFF; ESP = ESP + Source; //Release parameters from stack } } } break; } So before you worry about cache and pipeline, take a good look at the microcode implementation of such instruction first. Let me hear your opinion. |
|||
12 Jan 2017, 14:46 |
|
system error 12 Jan 2017, 14:52
This is what I love about this board. While people busy talking about cache and pipes, we go even deeper to microcode programming. Why don't you people just die already?
|
|||
12 Jan 2017, 14:52 |
|
system error 12 Jan 2017, 14:54
fragment wrote: good morning, folx ... It's time for the good guys like us to enlighten our sinful bretheren like AsmGuru and revolution. |
|||
12 Jan 2017, 14:54 |
|
system error 12 Jan 2017, 15:00
Oh btw, the true philosophy behind Microsoft fastcall convention and ABI is not to give you brain tumor, but simply to avoid those slower complex instructions like PUSH and POP.
Take it or leave it. |
|||
12 Jan 2017, 15:00 |
|
revolution 12 Jan 2017, 15:00
system error wrote: Faster RISC-style instruction for RET Quote: Slower CISC-based instruction RET (microcode) Anyhow, as mentioned above test the code to make sure you get what you expect. There is a lot of old, and just plain wrong, advice on the Internet. Don't blindly trust it. CPUs are constantly changing in their internal designs so you never know what is now better and what is now worse. |
|||
12 Jan 2017, 15:00 |
|
Goto page 1, 2, 3 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.