flat assembler
Message board for the users of flat assembler.
Index
> Main > Instructions cost |
Author |
|
revolution 12 Jun 2021, 23:03
It isn't so easy to talk about cycle counts of a single instruction in a modern high performance CPU.
The Out-Of-Order (OOO) core makes the timings variable. The same instruction at the same address can take different times depending upon the context of previous and future instructions in the queues, and the bottlenecks in other parts of the CPU. |
|||
12 Jun 2021, 23:03 |
|
Overclick 12 Jun 2021, 23:54
As I said that is a second question everyone understand but basic is basic. If "shifting" cost more than "move reg to reg", I prefer to prepare required value in some empty register for example.
|
|||
12 Jun 2021, 23:54 |
|
revolution 13 Jun 2021, 00:23
You can't know the cost until you execute it.
The cost is not fixed. Sometimes it will be cheaper, sometimes more expensive. Static analysis like that simply isn't possible. Not because we at too dumb to figure it out, but because the internal state of the CPU is unknown and unknowable at compile time. |
|||
13 Jun 2021, 00:23 |
|
bitRAKE 13 Jun 2021, 05:57
Some resources if you want comparative measurements, but they should be taken with a grain of salt.
https://www.uops.info/table.html https://asmjit.com/asmgrid/ http://instlatx64.atw.hu/ ..and of course, Agner Fog's wonderful resource: https://www.agner.org/optimize/instruction_tables.pdf Things get complicated very quickly - just reading from memory has dynamics of first read and then when it will be evicted from cache, etc. Developing a test framework is the only practical way if you are serious. Is CDQ the fastest way to write to a register? _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup Last edited by bitRAKE on 13 Jun 2021, 06:08; edited 2 times in total |
|||
13 Jun 2021, 05:57 |
|
Overclick 13 Jun 2021, 06:03
Quote:
But it's easy to calculate on testing code, just execute some loop that 100 per cent feet inside L1 and need to know result from each previous step. Simple benchmark isn't it? The rest tricks of optimisation is not a question at the moment. |
|||
13 Jun 2021, 06:03 |
|
bitRAKE 13 Jun 2021, 06:08
Check out these ...
https://github.com/nidud/asmc/tree/master/source/test/benchmark |
|||
13 Jun 2021, 06:08 |
|
Overclick 13 Jun 2021, 06:13
bitRAKE Thanks man.
|
|||
13 Jun 2021, 06:13 |
|
revolution 13 Jun 2021, 07:09
Overclick wrote:
If you use some sort of test code that only ever does one thing 1000000 times in a loop then the results will be only valid for that code, on that CPU, running that OS, etc. and be useless for when it is run inside the real application. Real applications do other things that alter the internal state and thus alter all the timings. That is what is meant by the timings are not fixed. All the timings are dependant upon other things happening in the CPU, and also what happened in the past (i.e. things like cache state, and resource port availability) and what happens in the future (i.e. the OOO engine might do something else first). So a single purpose test case won't show results that mean anything outside of that test case. If your timing measurements in the normal use case don't show any reliable results (which is quite common IME) then that means it doesn't matter if you choose add over mov or whatever, because other stuff happening in the system has a much larger effect and you can never know if add vs mov is even making a difference. |
|||
13 Jun 2021, 07:09 |
|
sinsi 13 Jun 2021, 08:13
On another forum we were obsessed with clock cycles for a few years.
Eventually we gave up because of the variation in timings. The exact same basic setup (CPU, memory, Windows version) would give different results. |
|||
13 Jun 2021, 08:13 |
|
bitRAKE 13 Jun 2021, 14:41
AMD has some good tools for profiling code - signed drivers to read the performance counters, heat maps, power profiling, ... (that's something I haven't tried doing - coding explicitly to reduce power consumption)
https://developer.amd.com/amd-uprof/ Intel probably has something similar. Unless it's for learning, one really has to ask themselves, "Is this going to execute billions of times?" Or if you have a tool that just displays the performance data as you code then you learn rough patterns which are biased/limited by those rules. Processors improvements have been more of a selection of trade-off for a while. That's why we can see cycle increases in some instructions and decreases in others. I think quantum tunneling is the only way we know how to transfer information without generating heat (or was it entanglement - need to find a PhD to inquire). _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
13 Jun 2021, 14:41 |
|
Overclick 13 Jun 2021, 19:10
Agner Fog's tables is exactly info I was searching for. The rest is code optimisation I know that. My own benchmark is my audio engine. The problem is I cannot rebuild it all the time just to give up again and again. I doing that full month. Now I see what instructions gonna be my favorite and optimize my work with them. No divide at all for example.
Open question about bit shifting, will it take cycle for each bit or not. I'll see it soon but anyway it is best alternative to division what is one of slowest instruction in cpu no matter integer or float. One more question about address variations. mov [rdi+rcx*4],r32 How cpu work with this kind of addressing? Will it calculate address as normal add/mul instructions or it use some another extension mechanics? |
|||
13 Jun 2021, 19:10 |
|
bitRAKE 13 Jun 2021, 23:50
IIRC, address calculation is part of AGU/ACU (address generation unit) - this is a specialized ALU - modern processors have a couple. Heavy contention on then can cost extra cycles, but the processor always seems to move stuff around once the instructions are broken down (seems low priority for optimization to me).
Where complex addressing sometimes is an issue - instruction decoding - the instructions are longer, and simpler addressing might decode faster. This is also less of an issue in modern processors due to micro-op caching -- decode would only happen first time through. shr reg/mem,1 shr reg/mem,cl shr reg/mem,# ...are all different encodings. Some processors have different paths for each of them. Need to look them up/test. Sometimes it helps to look at a processor block diagram: https://en.wikichip.org/wiki/intel/microarchitectures/skylake https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Block_Diagram Try to understand what each part is responsible for. There will always be bubbles of inactivity in the pipeline, but sometimes you will find more efficient utilization of resources - first you must know what resources are available. _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
13 Jun 2021, 23:50 |
|
Overclick 14 Jun 2021, 07:56
Seems like classical movs instructions are slowly too even by rep. It takes 4 MOPs at conveyor. Interesting...
What is a different between movaps vs movdqa? |
|||
14 Jun 2021, 07:56 |
|
revolution 14 Jun 2021, 08:35
Overclick wrote: Seems like classical movs instructions are slowly too even by rep. It takes 4 MOPs at conveyor. Interesting... And, as usual, always test your code to see if there is any effect. Just reading the docs is not enough. |
|||
14 Jun 2021, 08:35 |
|
Furs 14 Jun 2021, 13:07
Overclick wrote: What is a different between movaps vs movdqa? movapd is useless; movaps is better since it's smaller so use it always. movdqa if you work with integers and don't mind larger code, but otherwise movaps works fine, I'd use it all the time honestly. It's a different thing if you use AVX prefixed versions though. Then it's not smaller. |
|||
14 Jun 2021, 13:07 |
|
Overclick 15 Jun 2021, 11:26
Thanks mates
Engine almost recreated, 52224000 divided values per second, 13056000 divps instructions. Seems it will burn my cpu again )) I just want to see it before completely move to bit shifting... I have to start new discussion about audio filters, seems I do it wrong way. Simple counters aren't simple (for performance) |
|||
15 Jun 2021, 11:26 |
|
Overclick 15 Jun 2021, 20:06
Finally it runs few times faster than APO I used before. No need to panic. I did just global optimisations and it gives impressive performance. A lot of little things I can improve locally too.
|
|||
15 Jun 2021, 20:06 |
|
Melissa 18 Sep 2021, 11:53
Overclick wrote: All of us need to know opcode length and how many cycles instructions need. Where can I find this simple but very important information? All I've got this picture only. I don't ask about parallel optimisation and jump predictor blah-blah-blah. I just want to know cycles cost. You shouldnćt think about microoptimisations, that is job for compiler. Human bits compiler by making faster algorithm |
|||
18 Sep 2021, 11:53 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.