flat assembler
Message board for the users of flat assembler.
![]() Goto page 1, 2, 3 Next |
Author |
|
vivik 08 Aug 2017, 10:25
So, 0x2E and 0x3E prefixes are actually useful or not?
|
|||
![]() |
|
vivik 08 Aug 2017, 10:47
You keep telling "try for yourself", but testing on just one or two computers is pretty useless.
Maybe I'll ask people here to test those, on their machines... Wonder how many people are willing to do that. |
|||
![]() |
|
revolution 08 Aug 2017, 10:56
Without the tests you can't know if it is worthwhile. You might be including code that has no effect, or has a negative effect.
Test in the expected end-use case. If you don't know the end-use case then I don't know what to suggest. But guessing at the outcome would not be productive, there are too many interacting subsystems inside the CPUs that things become complicated very quickly. |
|||
![]() |
|
vivik 08 Aug 2017, 11:13
I'm making a game, it's expected to run on all shitty computers without problems. I own only two relatively shitty ones, not shitty enough though.
|
|||
![]() |
|
revolution 08 Aug 2017, 11:22
If it is code that relies upon realtime responses then I would suggest that if the tiny improvement you might get from using branch hints is a make-or-break situation then you need to rethink the approach. There should be sufficient headroom to allow for unpredictable variances.
I'd expect branch hints to be useful where you have some long running task and would like it to finish a bit faster. In such cases the hardware is already known and appropriate tests can easily be run. |
|||
![]() |
|
vivik 08 Aug 2017, 11:53
I just don't want to do something stupid like ignoring something that cpu makers did for me.
Okay, so, this link https://stackoverflow.com/questions/14332848/intel-x86-0x2e-0x3e-prefix-branch-prediction-actually-used tells that those are no longer used, since pentium 4. But this page still metions those, for some reason https://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts So, pentium 4 was the first and last processor that actually used those prefixes? |
|||
![]() |
|
Furs 08 Aug 2017, 11:57
I was going to mention http://www.agner.org/optimize/, but it was already linked in the first link on stackoverflow you saw. Yes they're no longer used.
It's a good thing because it abstracts the whole thing, making newer CPUs able to do more without having to recompile the code at all. Plus, it's one byte less bloat. ![]() |
|||
![]() |
|
revolution 08 Aug 2017, 12:06
The P4 was "special" in that the pipeline was very long to facilitate higher clock speeds during the clock speed wars. Also the branch table, and other enhancements, were much more limited. So Intel decided to pass the responsibility on to the programmer to ensure good performance in critical loops. Since then things have changed, radically, multiple times. Things are very different today. And this is why I encourage testing. Because tomorrow when someone encounters this discussion with their new-fangled CPU the answer will still be valid: Test it. Don't rely on some long since outdated Intel article. Also, there is no guarantee the branch hints won't be reintroduced in the future.
|
|||
![]() |
|
vivik 08 Aug 2017, 12:11
His microarchitecture.pdf is so detailed, I wouldn't be able to digest it. I guess I'll believe stackoverflow then.
I found some C and C++ code that uses a lot of likely() unlikely() in it. I thought they actually add those prefixes, but looks like they just rearrange branches around. What's the point of those when, I can rearrange it by hand, it'll even be more readable. |
|||
![]() |
|
Furs 08 Aug 2017, 12:24
Likely/unlikely are high-level code hints, nothing to do with prefixes (unless you compile for P4 I guess). What do you mean with re-arranging by hand? The compiler performs a lot of transformations of your code. A branch could even be eliminated. You can't know that or shouldn't rely on that.
They don't always have an effect, but with HLL code it's to specify intent more so than an actual, direct effect in the output. It could re-arrange it or it could not, it depends on its optimizations. Don't rely on this behavior. For example a branch doesn't need to be marked unlikely for it to be marked "cold" -- if it calls a cold function only, it is also marked cold in GCC. |
|||
![]() |
|
vivik 08 Aug 2017, 12:37
Um, that intel link says "when writing if-else or switch statements, check the most common cases first and work progressively down to the least common".
Doesn't that mean that this code Code: if (someFunc()) { expectedThis(); } else { notExpectedThis(); } and this code Code: if (likely(someFunc())) { expectedThis(); } else { notExpectedThis(); } are equivalent? And maybe this one is same too. Code: if (unlikely(someFunc())) { notExpectedThis(); } else { expectedThis(); } The library I've seen it is supposed to be well tested, so I wonder wth. Yes, I'll test it, I'll test it. This one is actually possible to test. Not sure how I'm supposed to study compiler's output, gdb is a torture device. Last edited by vivik on 08 Aug 2017, 12:41; edited 1 time in total |
|||
![]() |
|
revolution 08 Aug 2017, 12:40
If you test it I think you will find the difference is unmeasurable. Really. These things make no measurable difference for most cases. These are meant for micro-improvements in critical tight loops.
|
|||
![]() |
|
Furs 08 Aug 2017, 16:17
vivik wrote: Um, that intel link says "when writing if-else or switch statements, check the most common cases first and work progressively down to the least common". likely() is a hint to the compiler's optimizer. A compiler's optimizer is allowed to change and transform your code as long as the output is the same as defined by the language. It cannot change the chain of ifs differently because then the program can do something else (imagine if a function has side effects, like printing something) and the language explicitly forbids it. likely() won't change the if-else chain, it will only hint the optimizer to optimize differently -- for example, unrolling the likely branch even if it doesn't happen as long as no side-effects are done (i.e. function calls with side effects won't participate in this optimization). For switch statements, it doesn't make much sense, especially for large ones. They will be mostly implemented as a lookup jump table. ![]() |
|||
![]() |
|
vivik 09 Aug 2017, 19:28
Tried an example from this page http://blog.man7.org/2012/10/how-much-do-builtinexpect-likely-and.html
Assembly looks the same for correct and incorrect prediction. Result is also mostly same. Weird. |
|||
![]() |
|
Furs 09 Aug 2017, 20:34
__builtin_expect (what likely/unlikely use) is not about branch predicting in the CPU. It's about the compiler. Of course, the compiler could use such information to improve branch prediction in the CPU, but it does it anyway if it knows which path is likely or not -- various optimizations provide such information (and there's also profiling). That's in the optimization passes of course.
You can use __builtin_expect on variables as well though, if you expect them to have a certain value or be lower than something, the compiler could in theory generate code using such assumption as "likely". Either way it's a compiler thing and doesn't translate to anything in the CPU directly. For certain code, it may do nothing at all. (I mean, literally, same asm code output) |
|||
![]() |
|
system error 10 Aug 2017, 03:15
@vivik
The design of your code's branch prediction is NOT centered around the if-else constructs (or anything similar) per-se. The design is done much earlier and much bigger than that. By the time it reaches the if-else statement, it is probably, and at some point, useless to think about prediction/misprediction because the overall design of such code in question is probably not prediction-friendly to begin with in the first place. Let's take FuncA(). This is designed around a "Yes" logical path. So whatever you do in your if-else statement, this function will always be prediction-friendly function because it was designed around the "truth". Any "false" jumps will not mess up with the main path most of the time. FuncB(). No hot path consideration. It's up to the variables to decide the jumps. This will suffer the most and any micro design of if-else constructs inside it will not improve the speed much. |
|||
![]() |
|
system error 10 Aug 2017, 10:57
EDIT:
To reduce the risk of branch mispredictions, you need to introduce some bias in your code design, favoring the most likely path that can do the main task to be solved at hand. Letting the code freely decides what branches to take, without any bias from the programmers will not help you crafting a better if-else constructs. The main strategy is always to lay out the main sequence of logical constructs that can achieve the final objective and insert any "unlikely" cases later in the code, where necessary. For example, lets take a main "solution" sequence, biased (True case) towards a positive input from (RAX) and see how it can avoid misprediction that may occur. Code: mov rbx,10 xor rdx,rdx mul rbx nop nop add rax,1 ret Now, let's build an if-else construct around it (False case) Code: format elf64 public main main: mov rcx,100_000_000 @@: mov rax,5 ;-5 ;test value call t_biased_function sub rcx,1 jnz @b ret ;------------------------------------ ;In favor of T ;Design: RAX is around +ve value ;------------------------------------ t_biased_function: mov rbx,10 cmp rax,0 jl .ok ;don't take the jump xor rdx,rdx mul rbx nop nop .ok: add rax,1 ret The result Code: perf stat -B ./testcode 0.21xxxx secs for T (true) 0.26xxxx secs for F (false) This despite the facts that; i) T-case has far more codes ii) T-case has an obvious dependency chain (MUL/ADD) iii) T-case has a MUL instruction Last edited by system error on 10 Aug 2017, 13:12; edited 2 times in total |
|||
![]() |
|
Furs 10 Aug 2017, 11:22
I have no idea what your test is supposed to accomplish? Of course the mixed case is slower because the CPU fails to branch predict it. It's faster since in that case the CPU will branch predict it properly in all cases. Branch mispredictions are extremely expensive because the entire predicted operations in the pipeline have to be flushed when it is proven wrong. The more a branch is taken in one way (i.e. always taken or always not taken) the more the CPU can predict it the next time to be the same way. If it is the same way, it's 10-20+ clock cycles faster. Read Agner's pdf it explains this and the mispredicted hits.
The branch is not slower by itself. The fact is that the CPU does not wait for the operations to finish. Even if they have a dependency. It looks for more instructions to decode and execute. But which path should it decode? That's the whole point of branch prediction. If it's proven false, everything it decoded past that point is discarded. All that time was wasted. It has to start at the branch again after such a long time. Of course it's slower. Note that most CPUs assume the branch by default is not taken. So it's better to make it so that it's not taken in the hot case. After a large amount of branches the differences get smaller though. PS: your code is extremely badly written, I hope it was just for demonstration, it has a lot of needless bloat and useless instructions (even for what it does), a compiler would easily outdo it. Especially Intel's Compiler. xor rdx, rdx is useless also and can be removed completely -- it doesn't even make the code slower, except for worse cache (in this case it doesn't matter since it's small enough), since it just renames the register and doesn't use any port. (nevermind that it should be xor edx, edx -- one byte less to decode, better cache utilization). In general avoid the r* registers unless you absolutely need them, since they require REX prefix, and all operations on e* registers zero-extend by design anyway (GCC even emits xor r8d, r8d for example, even if it also uses REX prefix anyway). This is why "int" is 32-bit in C on amd64, since it's what type promotion would use (almost all operations, except those using int64_t or pointers, are 32-bit). Also add reg,1 is a stupid antiquated practice from the days of P4, please use inc/dec. The only time they will ever be slower is when the add would break the code (depending on carry flag being unchanged). All decent processors track flag bits individually, except P4. Intel's Compiler, obviously, emits inc/dec. Fairly sure they know their CPUs best. |
|||
![]() |
|
system error 10 Aug 2017, 11:36
^
^ as usual, ALL TALK. No proof. Wait till that guy above me quoting something from Microsoft / Intel to make him looks "smart" and "important" while he himself remain largely clueless and probably takes 2 full threads to finally understand the simplest thing even a Gorilla can understand! xD |
|||
![]() |
|
Goto page 1, 2, 3 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2023, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.