flat assembler
Message board for the users of flat assembler.
 Home   FAQ   Search   Register 
 Profile   Log in to check your private messages   Log in 
flat assembler > Main > Conditional moves vs jumps

Goto page 1, 2, 3  Next
Author
Thread Post new topic Reply to topic
MrFox



Joined: 17 Aug 2016
Posts: 52
Location: Russia
Conditional moves vs jumps
Hi, I have a question about CPU performance in case of conditional moves/jumps. I suppose it mainly concerns branch prediction and pipeline flushes in case of mispredictions, which is bad by default. Not sure whether cmovXX is better than jXX Can you clarify?

Two pieces of code doing exactly the same thing:

Code:
mov eax,1
mov ebx,2
mov ecx,3
cmp ecx,3
cmove eax,ebx ; Conditional Mov if Equal (ZF is set)
...

and

Code:
mov eax,1
mov ebx,2
mov ecx,3
cmp ecx,3
jne @f
mov eax,ebx
@@...


Which one is better and why?
Post 13 Nov 2016, 08:00
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 14585
Location: Planet Dirt
How do you want to judge "better"? If you mean the one which uses the least number of clock cycles to run, then the answer depends upon your application. There is no one answer that is "better" in all cases.

Test each method in the final application on the target platform. Then choose which method to use based upon the results. Keep in mind that different systems will give different results.
Post 13 Nov 2016, 21:48
View user's profile Send private message Visit poster's website Reply with quote
MrFox



Joined: 17 Aug 2016
Posts: 52
Location: Russia
Thanks for the explanation.
Post 15 Nov 2016, 05:41
View user's profile Send private message Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3415
Location: Bulgaria
@revolution - your "classic" answer on such questions is of course right, but totally not constructive.

Of course "it depends", but in most cases using cmov is better, because it is usually faster and also improves the readability of the code, which is good by itself.
Post 15 Nov 2016, 08:37
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 14585
Location: Planet Dirt
If the speed is so important to an application then simply using "rules of thumb" is not useful or constructive. IMO people should always be encouraged to actually test what they are writing, and not assume anything when the difference could be critical.

OTOH if the speed is not really important then it makes no difference which one you use.

But here is the thing that a lot of people don't try to figure out: Is speed really important in the code you are currently editing? Lots of people assume it is important but very often forget about the gorilla in the room in code they don't see (i.e. library and API calls). This is why I encourage people to test it. Mostly because in almost all cases I doubt they will see any difference in the actual usage case. And where they do see a difference then the testing data become very useful for further optimising trials.

And most importantly: When possible test in the actual application. Using synthetic loops to try to prove something is faster/better rarely provides any useful data, except in the very grossest of margins where it probably would've been obvious anyway.
Post 15 Nov 2016, 09:01
View user's profile Send private message Visit poster's website Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3415
Location: Bulgaria
@revolution, one can not test for every x86 CPU out there. It is fully impossible and useless after all, because optimizing the program for one single CPU means that most of your program users will use it in suboptimal variant and only few will get the maximal performance.

Also, optimizing to the last extent is very rarely needed, so teaching beginners to optimize this way is not constructive. But the advanced programmers will have their thoughts on this matter.

Also, you seems to fully ignore the readability of the code, that in most cases is more important than the maximal performance.
Post 15 Nov 2016, 10:17
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 14585
Location: Planet Dirt

JohnFound wrote:
@revolution, one can not test for every x86 CPU out there. It is fully impossible and useless after all, because optimizing the program for one single CPU means that most of your program users will use it in suboptimal variant and only few will get the maximal performance.

Yes, That is why we test things, to find out where things are non-per-formant. And testing will also reveal where it makes no difference, and this is a more important point because ...

JohnFound wrote:
Also, optimizing to the last extent is very rarely needed, so teaching beginners to optimize this way is not constructive.

Thanks for finishing my sentence. Razz

JohnFound wrote:
But the advanced programmers will have their thoughts on this matter.

And they are often wrong, because so many people don't even test what they write, advanced or not. If you simply assume cmov is faster and never test it then you could miss out on discovering a cache thrashing problem that is solved by using jmp instead. So all these "rules" are not rules at all. And only proper testing can reveal that.

JohnFound wrote:
Also, you seems to fully ignore the readability of the code, that in most cases is more important than the maximal performance.

I like to steer people away from useless optimising by having them test things. Once they realise that the performance makes no difference for most of their code then they can concentrate on other things that you mention like readability, ease of expansion, whatever. But testing is not an easy thing to do so any experience learned will be useful later when the real performance bottlenecks need to be solved.
Post 15 Nov 2016, 10:28
View user's profile Send private message Visit poster's website Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2096
Location: Usono (aka, USA)

JohnFound wrote:

Of course "it depends", but in most cases using cmov is better, because it is usually faster and also improves the readability of the code, which is good by itself.



Actually, I heard that CMOV is slower unless the jumps are rare and thus not predictable. Besides, you need a 686+ cpu in order to support it. (Not a huge problem anymore these days, but there are still many legacy machines, so unless you're 1000% sure that it's 1000% faster, supporting older cpus can't hurt.)


Quote:
@revolution, one can not test for every x86 CPU out there. It is fully impossible and useless after all, because optimizing the program for one single CPU means that most of your program users will use it in suboptimal variant and only few will get the maximal performance.



The "blended" approach uses code that works well for all targeted processors (e.g. both 486 and 586). The other way is to manually use CPUID and cpu-specific code for the rare tasks where it would make a major difference (e.g. MMX or SSE2, if available).
Post 20 Nov 2016, 09:52
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 14585
Location: Planet Dirt
I actually have never encountered a situation where cmov vs jmp will make any real world difference. But just in case there are some cases where it will make a noticeable difference then testing will tell you both 1) whether or not it matters at all (this is the most likely case), and 2) which method to apply in each case.

So perhaps there is a "rule" that we can apply to all these 'is A better than B' questions: Answer - No, A is not better than B, and B is not better then A either; unless you can conclusively prove otherwise with proper testing in the real application.
Post 20 Nov 2016, 10:04
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 230
Nah, cmov is obviously better. Because it's cooler. cmov is not just conditional move. It's a cool move. Jumps are lame.
Post 07 Dec 2016, 12:27
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 14585
Location: Planet Dirt
So:
CMOVcc = Cool
JMP = Rolling Eyes

I'm not so convinced Confused
Post 07 Dec 2016, 12:37
View user's profile Send private message Visit poster's website Reply with quote
system error



Joined: 01 Sep 2013
Posts: 438
CMOV uses the same ALU circuitry as CMP/JCC, for better or worse. Please note that there's no dedicated flip flop for CMOV. CMOV is also a complex instruction that uses the same logical construct as CMP/JCC down at the circuitry level. Speedwise, there's no difference. Bytewise, CMOV wins only by 1 byte.

CMOV complex instruction as defined by Intel:


Code:
Temporary = Source;
if(Condition == trueDestination = temp;
temp = Source;



So tell me how's this any faster than plain CMP/JCC combo?
Post 07 Dec 2016, 14:07
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 14585
Location: Planet Dirt

system error wrote:
CMOV uses the same ALU circuitry as CMP/JCC ...

Surely that depends upon the specific CPU? Past, present and future CPUs, from different manufacturers, will almost certainly have different circuitry layouts, reuse policies and timing dependencies.


Last edited by revolution on 08 Dec 2016, 01:50; edited 1 time in total
Post 07 Dec 2016, 14:11
View user's profile Send private message Visit poster's website Reply with quote
system error



Joined: 01 Sep 2013
Posts: 438
Agreed. Run some synthetic ALU tests on both intel and amd, and amd is unfortunately, inferior.
Post 07 Dec 2016, 14:58
View user's profile Send private message Reply with quote
redsock



Joined: 09 Oct 2009
Posts: 251
Location: Australia
I was interested to see the difference between my Intel and AMD platforms, so I wrote the following bit of [useless] code:

Code:
        format ELF64

public _start
_start:
        mov     ebx1 shl 31
        xor     ecxecx
.loop:
if defined use_cmov
        lea     eax, [ecx+1]
        test    ebx1
        cmovnz  ecxeax
else
        test    ebx1
        jz      .noadd
        add     ecx1
.noadd:
end if
        sub     ebx1
        jnz     .loop

        mov     eax60         ; exit
        xor     ediedi        ; return code
        syscall




On my Intel 3930k:
cmov based takes 2107ms with 6.6B cycles
jmp based takes 1546ms with 4.8B cycles

On my older AMD 6282SE opteron:
cmov based takes 2481ms with 7.1B cycles
jmp based takes 4101ms with 11.9B cycles

(mind you, both machines are busy so the perf stats might not be quite-right)

Just goes to show that @revo's usual sentiment of things like this really being architecture-dependent are indeed true. Here, the AMD likes cmov much better than jmp, and the Intel prefers the jump variety.

I'd like to add though that depending on how complex the branch actually is, the misprediction and instruction decode penalties for branch-based can be much worse than the cmov variety (my silly example above isn't affected because it is too simple).

YMMV Smile

_________________
2 Ton Digital - https://2ton.com.au/
Post 07 Dec 2016, 22:18
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 14585
Location: Planet Dirt
While I am pleased to see someone finally actually test this (thanks redsock), it pains me somewhat to see it using only a synthetic test, but the saving grace is that it was able to directly compare two architectures with the same code. However I would like to stress that these synthetic tests do nothing towards showing what would happen in someone's real application. With different cache contents, different branch histories, different instructions preceding and succeeding in the queue, different data patterns, etc. the outcomes will be different.
Post 08 Dec 2016, 00:33
View user's profile Send private message Visit poster's website Reply with quote
l_inc



Joined: 23 Oct 2009
Posts: 868
I'd like to oppose revolution's opinion about measurements here, cause it seems to be falsely supported. Blind measurements are useless, while proper general notions are to a large extent useful even without measurements. If you measure, measure to check your general notions, not to find out what's faster while having no clue about things happening under the CPU hood.

If you have good general notions, then you'll have 20, maybe 30 degrees of freedom such as latencies, throughputs, alignments, dependency chains, cache hits, TLB hits, branch predictions, possibility of engagement of special mechanisms (such as loop-stream detection or macro-fusion), availability of different CPU resources, etc. If you meaningfully apply these notions, you'll be good 95% of the time finding a nearly optimal organization of instructions significantly resilient to the changes in their environment.

If you just cluelessly test, then you'll have an unlimited number of degrees of freedom: you may manage to "optimize" one little piece of your software, but then any little change in any other piece may cause dramatical performance degradation, and the whole "optimization" will have to start from scratch.

So I guess I support JohnFound's comment here: the classical advice revolution gives like "there are so many variables, nothing is predictable, go test everything" while being partially correct is totally useless. Testing is bad as an optimization guide, but it is good to examine reasonable predictions. revolution is right that conscious optimization requires many things and side-effects to consider, and the test results can indeed be surprising. But that only means that the test results should be used to improve general notions, not to get completely disappointed and fallback to black box testing.

My point is btw. well supported by the redsock's test. The test with no discussion is just black box testing and literally nothing can be inferred from the results. While I have similar results (with i5-4200M CPU @ 2.50GHz, Haswell) for the exact same code:

Code:
jz:     1054ms;            cmovnz: 2105ms


— if I have a notion about how branches are tagged in the branch predictor on Haswell, I can easily invert the results by introducing a couple of innocuous nop's, so that both branches (i.e., their last bytes) fall into the same 16-bytes block:

Code:
        format ELF64

public _start
_start:
times 2 nop
        mov     ebx1 shl 31
        xor     ecxecx
.loop
if defined use_cmov 
        lea     eax, [ecx+1]
        test    ebx1
        cmovnz  ecxeax
else
        test    ebx1
        jz      .noadd
        add     ecx1
.noadd:
end if
        sub     ebx1
        jnz     .loop

        mov     eax60         ; exit
        xor     ediedi        ; return code
        syscall


Note the conflict due to the first branch ending at 0x600120 while the second branch ends in the same 16-bytes block at 0x600128:

Code:
0000000000600110 <_start>:
  600110:       90                      nop
  600111:       90                      nop
  600112:       bb 00 00 00 80          mov    $0x80000000,%ebx
  600117:       31 c9                   xor    %ecx,%ecx
  600119:       f7 c3 01 00 00 00       test   $0x1,%ebx
  60011f:       74 03                   je     600124 <_start+0x14>
  600121:       83 c1 01                add    $0x1,%ecx
  600124:       83 eb 01                sub    $0x1,%ebx
  600127:       75 f0                   jne    600119 <_start+0x9>
  600129:       b8 3c 00 00 00          mov    $0x3c,%eax
  60012e:       31 ff                   xor    %edi,%edi
  600130:       0f 05                   syscall 


Now the results are:

Code:
jz:     3136ms;            cmovnz: 2094ms


One can nicely see, how the branch predictor conflict gradually resolves when replacing the immediate in test ebx, 1 with 2, 4, 8, 0x10, ..., as the taken/untaken pattern of the combined branching becomes more regular:

Code:
0x01:  3135ms;     0x02:  2759ms;     0x04:  2705ms;  0x08:  2444ms;  0x10: 2213ms



Is cmovnz now better? It depends. Not on some useless test results, but on reasonable considerations. The general notion in the redsock's comparison test is that control flow dependency is traded for value dependency (as cmovnz and lea mutually depend on each other's results). Now provided perfect branch prediction (as in redsock's test on the Intel machine) the trade-off becomes like: "Is a hammer or a screwdriver better for fixing a painting on a wood wall, provided you have only nails?". After I broke branch prediction, the trade-off condition became: "..., provided you have only screws". To say it in other words, of course control flow dependency is better, if there's effectively no control flow dependency due to perfect branch prediction.

Btw., is it fair at all to make cmovnz be dependent in such a tight loop? In many cases the dependent calculation can be interleaved with other anyway necessary calculations. To make the test more "fair" we can replace lea eax, [ecx+1] with add eax, 1. Of course, the calculations are then not equivalent anymore, but cmovnz would get at least comparable microarchitectural conditions, because same as with jz there's no mutual dependency anymore. Without the devastating 2 nop's:

Code:
jz:     1072ms;            cmovnz: 1022ms



Seems like cmovnz is even somewhat better in a "fair" competition. Smile On my Haswell at least. I see the difference as statistically significant (because I've informally done multiple tests), and it can be explained by the larger resource footprint of the jz (specifically the branch predictor table and branch target buffer), which interferes with the footprint of interrupt handlers and parallel threads.

_________________
Faith is a superposition of knowledge and fallacy
Post 08 Dec 2016, 23:52
View user's profile Send private message Reply with quote
system error



Joined: 01 Sep 2013
Posts: 438
You guys are nuts. We're not talking about general design strategy for a complete code but rather on micro-optimization or more specifically performance comparison cmov vs jmp. I believe for a code to be fully optimized, one has to design it right from the beginning with the right optimization mindset, top-to-bottom approach. But to come up with a bold statement that CMOV is faster / better than the plain cmp/jcc may not be well-founded as far as instructions are concern. That's all there is.

So in this kind of micro-optimization between instructions, a synthetic test is the most appropriate measure without having to go through the code alignment and other complicated stuff because that would take different kind of measures, tools and strategies altogether. Because in the end everybody will come up with "it depends" answer without coming to anything conclusive. That will render any future discussions on optimization on this board useless and pretty much closed. Try to make it more interesting some times. Troll people around. Throw in some jokes.

But my take on whether CMOV faster than JCC on instruction set per se, is NO. That's my conclusion.
Post 09 Dec 2016, 01:56
View user's profile Send private message Reply with quote
l_inc



Joined: 23 Oct 2009
Posts: 868

system error wrote:
Because in the end everybody will come up with "it depends" answer without coming to anything conclusive.


The conclusion "it depends" is worth nothing unless that particular "everybody" explains on what exactly.

system error wrote:
So in this kind of micro-optimization between instructions, a synthetic test is the most appropriate measure


Again, this measure on its own gives contradictory results and is therefore useless.

_________________
Faith is a superposition of knowledge and fallacy
Post 09 Dec 2016, 02:26
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 14585
Location: Planet Dirt
l_inc: Your synthetic tests also prove nothing other than "it depends". Which in itself can be useful for people to realise that there is no general answer. Every application is different, so every result will be different.

system error: The answer for most "which is better?" questions we see here is, unfortunately, "it depends". That's just the way it is. It is better to have a "proper" answer than to lie and pretend that we can know that A is always "better" than B.

So let's review:

Test it. Testing is good. But test in your final application on the target platform, after everything is working.

Fully expect the test to show that there will be no significant difference when choosing between minor things like jcc vs cmovcc. Perhaps you will be lucky and your application has a very hot loop and it might make a huge difference, and testing will show this to you, so great. But you won't know until you test it.
Post 09 Dec 2016, 02:57
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2, 3  Next

< Last Thread | Next Thread >

Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Powered by phpBB © 2001-2005 phpBB Group.

Main index   Download   Documentation   Examples   Message board
Copyright © 2004-2016, Tomasz Grysztar.