flat assembler
Message board for the users of flat assembler.
![]() Goto page 1, 2, 3 Next |
Author |
|
revolution 13 Nov 2016, 21:48
How do you want to judge "better"? If you mean the one which uses the least number of clock cycles to run, then the answer depends upon your application. There is no one answer that is "better" in all cases.
Test each method in the final application on the target platform. Then choose which method to use based upon the results. Keep in mind that different systems will give different results. |
|||
![]() |
|
MrFox 15 Nov 2016, 05:41
Thanks for the explanation.
|
|||
![]() |
|
JohnFound 15 Nov 2016, 08:37
@revolution - your "classic" answer on such questions is of course right, but totally not constructive.
Of course "it depends", but in most cases using cmov is better, because it is usually faster and also improves the readability of the code, which is good by itself. |
|||
![]() |
|
revolution 15 Nov 2016, 09:01
If the speed is so important to an application then simply using "rules of thumb" is not useful or constructive. IMO people should always be encouraged to actually test what they are writing, and not assume anything when the difference could be critical.
OTOH if the speed is not really important then it makes no difference which one you use. But here is the thing that a lot of people don't try to figure out: Is speed really important in the code you are currently editing? Lots of people assume it is important but very often forget about the gorilla in the room in code they don't see (i.e. library and API calls). This is why I encourage people to test it. Mostly because in almost all cases I doubt they will see any difference in the actual usage case. And where they do see a difference then the testing data become very useful for further optimising trials. And most importantly: When possible test in the actual application. Using synthetic loops to try to prove something is faster/better rarely provides any useful data, except in the very grossest of margins where it probably would've been obvious anyway. |
|||
![]() |
|
JohnFound 15 Nov 2016, 10:17
@revolution, one can not test for every x86 CPU out there. It is fully impossible and useless after all, because optimizing the program for one single CPU means that most of your program users will use it in suboptimal variant and only few will get the maximal performance.
Also, optimizing to the last extent is very rarely needed, so teaching beginners to optimize this way is not constructive. But the advanced programmers will have their thoughts on this matter. Also, you seems to fully ignore the readability of the code, that in most cases is more important than the maximal performance. |
|||
![]() |
|
revolution 15 Nov 2016, 10:28
JohnFound wrote: @revolution, one can not test for every x86 CPU out there. It is fully impossible and useless after all, because optimizing the program for one single CPU means that most of your program users will use it in suboptimal variant and only few will get the maximal performance. JohnFound wrote: Also, optimizing to the last extent is very rarely needed, so teaching beginners to optimize this way is not constructive. ![]() JohnFound wrote: But the advanced programmers will have their thoughts on this matter. JohnFound wrote: Also, you seems to fully ignore the readability of the code, that in most cases is more important than the maximal performance. |
|||
![]() |
|
rugxulo 20 Nov 2016, 09:52
JohnFound wrote:
Actually, I heard that CMOV is slower unless the jumps are rare and thus not predictable. Besides, you need a 686+ cpu in order to support it. (Not a huge problem anymore these days, but there are still many legacy machines, so unless you're 1000% sure that it's 1000% faster, supporting older cpus can't hurt.) Quote: @revolution, one can not test for every x86 CPU out there. It is fully impossible and useless after all, because optimizing the program for one single CPU means that most of your program users will use it in suboptimal variant and only few will get the maximal performance. The "blended" approach uses code that works well for all targeted processors (e.g. both 486 and 586). The other way is to manually use CPUID and cpu-specific code for the rare tasks where it would make a major difference (e.g. MMX or SSE2, if available). |
|||
![]() |
|
revolution 20 Nov 2016, 10:04
I actually have never encountered a situation where cmov vs jmp will make any real world difference. But just in case there are some cases where it will make a noticeable difference then testing will tell you both 1) whether or not it matters at all (this is the most likely case), and 2) which method to apply in each case.
So perhaps there is a "rule" that we can apply to all these 'is A better than B' questions: Answer - No, A is not better than B, and B is not better then A either; unless you can conclusively prove otherwise with proper testing in the real application. |
|||
![]() |
|
Furs 07 Dec 2016, 12:27
Nah, cmov is obviously better. Because it's cooler. cmov is not just conditional move. It's a cool move. Jumps are lame.
|
|||
![]() |
|
revolution 07 Dec 2016, 12:37
So:
CMOVcc = ![]() JMP = ![]() I'm not so convinced ![]() |
|||
![]() |
|
system error 07 Dec 2016, 14:07
CMOV uses the same ALU circuitry as CMP/JCC, for better or worse. Please note that there's no dedicated flip flop for CMOV. CMOV is also a complex instruction that uses the same logical construct as CMP/JCC down at the circuitry level. Speedwise, there's no difference. Bytewise, CMOV wins only by 1 byte.
CMOV complex instruction as defined by Intel: Code: Temporary = Source; if(Condition == true) Destination = temp; temp = Source; So tell me how's this any faster than plain CMP/JCC combo? |
|||
![]() |
|
revolution 07 Dec 2016, 14:11
system error wrote: CMOV uses the same ALU circuitry as CMP/JCC ... Last edited by revolution on 08 Dec 2016, 01:50; edited 1 time in total |
|||
![]() |
|
system error 07 Dec 2016, 14:58
Agreed. Run some synthetic ALU tests on both intel and amd, and amd is unfortunately, inferior.
|
|||
![]() |
|
redsock 07 Dec 2016, 22:18
I was interested to see the difference between my Intel and AMD platforms, so I wrote the following bit of [useless] code:
Code: format ELF64 public _start _start: mov ebx, 1 shl 31 xor ecx, ecx .loop: if defined use_cmov lea eax, [ecx+1] test ebx, 1 cmovnz ecx, eax else test ebx, 1 jz .noadd add ecx, 1 .noadd: end if sub ebx, 1 jnz .loop mov eax, 60 ; exit xor edi, edi ; return code syscall On my Intel 3930k: cmov based takes 2107ms with 6.6B cycles jmp based takes 1546ms with 4.8B cycles On my older AMD 6282SE opteron: cmov based takes 2481ms with 7.1B cycles jmp based takes 4101ms with 11.9B cycles (mind you, both machines are busy so the perf stats might not be quite-right) Just goes to show that @revo's usual sentiment of things like this really being architecture-dependent are indeed true. Here, the AMD likes cmov much better than jmp, and the Intel prefers the jump variety. I'd like to add though that depending on how complex the branch actually is, the misprediction and instruction decode penalties for branch-based can be much worse than the cmov variety (my silly example above isn't affected because it is too simple). YMMV ![]() |
|||
![]() |
|
revolution 08 Dec 2016, 00:33
While I am pleased to see someone finally actually test this (thanks redsock), it pains me somewhat to see it using only a synthetic test, but the saving grace is that it was able to directly compare two architectures with the same code. However I would like to stress that these synthetic tests do nothing towards showing what would happen in someone's real application. With different cache contents, different branch histories, different instructions preceding and succeeding in the queue, different data patterns, etc. the outcomes will be different.
|
|||
![]() |
|
l_inc 08 Dec 2016, 23:52
I'd like to oppose revolution's opinion about measurements here, cause it seems to be falsely supported. Blind measurements are useless, while proper general notions are to a large extent useful even without measurements. If you measure, measure to check your general notions, not to find out what's faster while having no clue about things happening under the CPU hood.
If you have good general notions, then you'll have 20, maybe 30 degrees of freedom such as latencies, throughputs, alignments, dependency chains, cache hits, TLB hits, branch predictions, possibility of engagement of special mechanisms (such as loop-stream detection or macro-fusion), availability of different CPU resources, etc. If you meaningfully apply these notions, you'll be good 95% of the time finding a nearly optimal organization of instructions significantly resilient to the changes in their environment. If you just cluelessly test, then you'll have an unlimited number of degrees of freedom: you may manage to "optimize" one little piece of your software, but then any little change in any other piece may cause dramatical performance degradation, and the whole "optimization" will have to start from scratch. So I guess I support JohnFound's comment here: the classical advice revolution gives like "there are so many variables, nothing is predictable, go test everything" while being partially correct is totally useless. Testing is bad as an optimization guide, but it is good to examine reasonable predictions. revolution is right that conscious optimization requires many things and side-effects to consider, and the test results can indeed be surprising. But that only means that the test results should be used to improve general notions, not to get completely disappointed and fallback to black box testing. My point is btw. well supported by the redsock's test. The test with no discussion is just black box testing and literally nothing can be inferred from the results. While I have similar results (with i5-4200M CPU @ 2.50GHz, Haswell) for the exact same code: Code: jz: 1054ms; cmovnz: 2105ms — if I have a notion about how branches are tagged in the branch predictor on Haswell, I can easily invert the results by introducing a couple of innocuous nop's, so that both branches (i.e., their last bytes) fall into the same 16-bytes block: Code: format ELF64 public _start _start: times 2 nop mov ebx, 1 shl 31 xor ecx, ecx .loop: if defined use_cmov lea eax, [ecx+1] test ebx, 1 cmovnz ecx, eax else test ebx, 1 jz .noadd add ecx, 1 .noadd: end if sub ebx, 1 jnz .loop mov eax, 60 ; exit xor edi, edi ; return code syscall Note the conflict due to the first branch ending at 0x600120 while the second branch ends in the same 16-bytes block at 0x600128: Code: 0000000000600110 <_start>: 600110: 90 nop 600111: 90 nop 600112: bb 00 00 00 80 mov $0x80000000,%ebx 600117: 31 c9 xor %ecx,%ecx 600119: f7 c3 01 00 00 00 test $0x1,%ebx 60011f: 74 03 je 600124 <_start+0x14> 600121: 83 c1 01 add $0x1,%ecx 600124: 83 eb 01 sub $0x1,%ebx 600127: 75 f0 jne 600119 <_start+0x9> 600129: b8 3c 00 00 00 mov $0x3c,%eax 60012e: 31 ff xor %edi,%edi 600130: 0f 05 syscall Now the results are: Code: jz: 3136ms; cmovnz: 2094ms One can nicely see, how the branch predictor conflict gradually resolves when replacing the immediate in test ebx, 1 with 2, 4, 8, 0x10, ..., as the taken/untaken pattern of the combined branching becomes more regular: Code: 0x01: 3135ms; 0x02: 2759ms; 0x04: 2705ms; 0x08: 2444ms; 0x10: 2213ms Is cmovnz now better? It depends. Not on some useless test results, but on reasonable considerations. The general notion in the redsock's comparison test is that control flow dependency is traded for value dependency (as cmovnz and lea mutually depend on each other's results). Now provided perfect branch prediction (as in redsock's test on the Intel machine) the trade-off becomes like: "Is a hammer or a screwdriver better for fixing a painting on a wood wall, provided you have only nails?". After I broke branch prediction, the trade-off condition became: "..., provided you have only screws". To say it in other words, of course control flow dependency is better, if there's effectively no control flow dependency due to perfect branch prediction. Btw., is it fair at all to make cmovnz be dependent in such a tight loop? In many cases the dependent calculation can be interleaved with other anyway necessary calculations. To make the test more "fair" we can replace lea eax, [ecx+1] with add eax, 1. Of course, the calculations are then not equivalent anymore, but cmovnz would get at least comparable microarchitectural conditions, because same as with jz there's no mutual dependency anymore. Without the devastating 2 nop's: Code: jz: 1072ms; cmovnz: 1022ms Seems like cmovnz is even somewhat better in a "fair" competition. ![]() _________________ Faith is a superposition of knowledge and fallacy |
|||
![]() |
|
system error 09 Dec 2016, 01:56
You guys are nuts. We're not talking about general design strategy for a complete code but rather on micro-optimization or more specifically performance comparison cmov vs jmp. I believe for a code to be fully optimized, one has to design it right from the beginning with the right optimization mindset, top-to-bottom approach. But to come up with a bold statement that CMOV is faster / better than the plain cmp/jcc may not be well-founded as far as instructions are concern. That's all there is.
So in this kind of micro-optimization between instructions, a synthetic test is the most appropriate measure without having to go through the code alignment and other complicated stuff because that would take different kind of measures, tools and strategies altogether. Because in the end everybody will come up with "it depends" answer without coming to anything conclusive. That will render any future discussions on optimization on this board useless and pretty much closed. Try to make it more interesting some times. Troll people around. Throw in some jokes. But my take on whether CMOV faster than JCC on instruction set per se, is NO. That's my conclusion. |
|||
![]() |
|
l_inc 09 Dec 2016, 02:26
system error wrote: Because in the end everybody will come up with "it depends" answer without coming to anything conclusive. The conclusion "it depends" is worth nothing unless that particular "everybody" explains on what exactly. system error wrote: So in this kind of micro-optimization between instructions, a synthetic test is the most appropriate measure Again, this measure on its own gives contradictory results and is therefore useless. _________________ Faith is a superposition of knowledge and fallacy |
|||
![]() |
|
revolution 09 Dec 2016, 02:57
l_inc: Your synthetic tests also prove nothing other than "it depends". Which in itself can be useful for people to realise that there is no general answer. Every application is different, so every result will be different.
system error: The answer for most "which is better?" questions we see here is, unfortunately, "it depends". That's just the way it is. It is better to have a "proper" answer than to lie and pretend that we can know that A is always "better" than B. So let's review: Test it. Testing is good. But test in your final application on the target platform, after everything is working. Fully expect the test to show that there will be no significant difference when choosing between minor things like jcc vs cmovcc. Perhaps you will be lucky and your application has a very hot loop and it might make a huge difference, and testing will show this to you, so great. But you won't know until you test it. |
|||
![]() |
|
Goto page 1, 2, 3 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.