flat assembler
Message board for the users of flat assembler.

Index > Main > Instructions cost

Author
Thread Post new topic Reply to topic
Overclick



Joined: 11 Jul 2020
Posts: 394
Location: Ukraine
Overclick
All of us need to know opcode length and how many cycles instructions need. Where can I find this simple but very important information? All I've got this picture only. I don't ask about parallel optimisation and jump predictor blah-blah-blah. I just want to know cycles cost.
How bit shifting works for example? Will it take cycle per each shifted bit or it works complexly for all of them?


Description:
Filesize: 118.91 KB
Viewed: 2615 Time(s)

part101_infographics_v08.png


Post 12 Jun 2021, 20:16
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18222
Location: In your JS exploiting you and your system
revolution
It isn't so easy to talk about cycle counts of a single instruction in a modern high performance CPU.

The Out-Of-Order (OOO) core makes the timings variable. The same instruction at the same address can take different times depending upon the context of previous and future instructions in the queues, and the bottlenecks in other parts of the CPU.
Post 12 Jun 2021, 23:03
View user's profile Send private message Visit poster's website Reply with quote
Overclick



Joined: 11 Jul 2020
Posts: 394
Location: Ukraine
Overclick
As I said that is a second question everyone understand but basic is basic. If "shifting" cost more than "move reg to reg", I prefer to prepare required value in some empty register for example.
Post 12 Jun 2021, 23:54
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18222
Location: In your JS exploiting you and your system
revolution
You can't know the cost until you execute it.

The cost is not fixed. Sometimes it will be cheaper, sometimes more expensive.

Static analysis like that simply isn't possible. Not because we at too dumb to figure it out, but because the internal state of the CPU is unknown and unknowable at compile time.
Post 13 Jun 2021, 00:23
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3307
Location: vpcmipstrm
bitRAKE
Some resources if you want comparative measurements, but they should be taken with a grain of salt.

https://www.uops.info/table.html
https://asmjit.com/asmgrid/
http://instlatx64.atw.hu/

..and of course, Agner Fog's wonderful resource:
https://www.agner.org/optimize/instruction_tables.pdf

Things get complicated very quickly - just reading from memory has dynamics of first read and then when it will be evicted from cache, etc. Developing a test framework is the only practical way if you are serious.

Is CDQ the fastest way to write to a register?

_________________
¯\(°_o)/¯ unlicense.org


Last edited by bitRAKE on 13 Jun 2021, 06:08; edited 2 times in total
Post 13 Jun 2021, 05:57
View user's profile Send private message Visit poster's website Reply with quote
Overclick



Joined: 11 Jul 2020
Posts: 394
Location: Ukraine
Overclick
Quote:

The cost is not fixed

But it's easy to calculate on testing code, just execute some loop that 100 per cent feet inside L1 and need to know result from each previous step. Simple benchmark isn't it? The rest tricks of optimisation is not a question at the moment.
Post 13 Jun 2021, 06:03
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3307
Location: vpcmipstrm
bitRAKE
Post 13 Jun 2021, 06:08
View user's profile Send private message Visit poster's website Reply with quote
Overclick



Joined: 11 Jul 2020
Posts: 394
Location: Ukraine
Overclick
bitRAKE Thanks man.
Post 13 Jun 2021, 06:13
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18222
Location: In your JS exploiting you and your system
revolution
Overclick wrote:
Quote:

The cost is not fixed

But it's easy to calculate on testing code, just execute some loop that 100 per cent feet inside L1 and need to know result from each previous step. Simple benchmark isn't it? The rest tricks of optimisation is not a question at the moment.
You can benchmark, but not with simulation code. Use just the normal functioning program in the normal use case to benchmark.

If you use some sort of test code that only ever does one thing 1000000 times in a loop then the results will be only valid for that code, on that CPU, running that OS, etc. and be useless for when it is run inside the real application. Real applications do other things that alter the internal state and thus alter all the timings.

That is what is meant by the timings are not fixed. All the timings are dependant upon other things happening in the CPU, and also what happened in the past (i.e. things like cache state, and resource port availability) and what happens in the future (i.e. the OOO engine might do something else first). So a single purpose test case won't show results that mean anything outside of that test case.

If your timing measurements in the normal use case don't show any reliable results (which is quite common IME) then that means it doesn't matter if you choose add over mov or whatever, because other stuff happening in the system has a much larger effect and you can never know if add vs mov is even making a difference.
Post 13 Jun 2021, 07:09
View user's profile Send private message Visit poster's website Reply with quote
sinsi



Joined: 10 Aug 2007
Posts: 738
Location: Adelaide
sinsi
On another forum we were obsessed with clock cycles for a few years.
Eventually we gave up because of the variation in timings.
The exact same basic setup (CPU, memory, Windows version) would give different results.
Post 13 Jun 2021, 08:13
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3307
Location: vpcmipstrm
bitRAKE
AMD has some good tools for profiling code - signed drivers to read the performance counters, heat maps, power profiling, ... (that's something I haven't tried doing - coding explicitly to reduce power consumption)
https://developer.amd.com/amd-uprof/

Intel probably has something similar.

Unless it's for learning, one really has to ask themselves, "Is this going to execute billions of times?" Or if you have a tool that just displays the performance data as you code then you learn rough patterns which are biased/limited by those rules.

Processors improvements have been more of a selection of trade-off for a while. That's why we can see cycle increases in some instructions and decreases in others. I think quantum tunneling is the only way we know how to transfer information without generating heat (or was it entanglement - need to find a PhD to inquire).

_________________
¯\(°_o)/¯ unlicense.org
Post 13 Jun 2021, 14:41
View user's profile Send private message Visit poster's website Reply with quote
Overclick



Joined: 11 Jul 2020
Posts: 394
Location: Ukraine
Overclick
Agner Fog's tables is exactly info I was searching for. The rest is code optimisation I know that. My own benchmark is my audio engine. The problem is I cannot rebuild it all the time just to give up again and again. I doing that full month. Now I see what instructions gonna be my favorite and optimize my work with them. No divide at all for example.
Open question about bit shifting, will it take cycle for each bit or not. I'll see it soon but anyway it is best alternative to division what is one of slowest instruction in cpu no matter integer or float.
One more question about address variations. mov [rdi+rcx*4],r32
How cpu work with this kind of addressing? Will it calculate address as normal add/mul instructions or it use some another extension mechanics?
Post 13 Jun 2021, 19:10
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3307
Location: vpcmipstrm
bitRAKE
IIRC, address calculation is part of AGU/ACU (address generation unit) - this is a specialized ALU - modern processors have a couple. Heavy contention on then can cost extra cycles, but the processor always seems to move stuff around once the instructions are broken down (seems low priority for optimization to me).

Where complex addressing sometimes is an issue - instruction decoding - the instructions are longer, and simpler addressing might decode faster. This is also less of an issue in modern processors due to micro-op caching -- decode would only happen first time through.

shr reg/mem,1
shr reg/mem,cl
shr reg/mem,#

...are all different encodings. Some processors have different paths for each of them. Need to look them up/test.

Sometimes it helps to look at a processor block diagram:
https://en.wikichip.org/wiki/intel/microarchitectures/skylake
https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Block_Diagram

Try to understand what each part is responsible for. There will always be bubbles of inactivity in the pipeline, but sometimes you will find more efficient utilization of resources - first you must know what resources are available.

_________________
¯\(°_o)/¯ unlicense.org
Post 13 Jun 2021, 23:50
View user's profile Send private message Visit poster's website Reply with quote
Overclick



Joined: 11 Jul 2020
Posts: 394
Location: Ukraine
Overclick
Seems like classical movs instructions are slowly too even by rep. It takes 4 MOPs at conveyor. Interesting...

What is a different between movaps vs movdqa?
Post 14 Jun 2021, 07:56
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18222
Location: In your JS exploiting you and your system
revolution
Overclick wrote:
Seems like classical movs instructions are slowly too even by rep. It takes 4 MOPs at conveyor. Interesting...
These times are not absolute. Other CPUs, either older or newer, have different timings. Also, the rate can be "faster" when the R/E/CX value is "large". Check the full Intel/AMD docs to see how it is implemented for each CPU.

And, as usual, always test your code to see if there is any effect. Just reading the docs is not enough.
Post 14 Jun 2021, 08:35
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 1647
Furs
Overclick wrote:
What is a different between movaps vs movdqa?
Some CPUs put a different "domain" to a register (floating or integer) depending on the instruction. Having the wrong domain and executing an instruction for the different domain causes a small latency penalty. However, movapd is useless. No current CPU cares about the difference since both use floating point domain.

movapd is useless; movaps is better since it's smaller so use it always. movdqa if you work with integers and don't mind larger code, but otherwise movaps works fine, I'd use it all the time honestly.

It's a different thing if you use AVX prefixed versions though. Then it's not smaller.
Post 14 Jun 2021, 13:07
View user's profile Send private message Reply with quote
Overclick



Joined: 11 Jul 2020
Posts: 394
Location: Ukraine
Overclick
Thanks mates
Engine almost recreated, 52224000 divided values per second, 13056000 divps instructions. Seems it will burn my cpu again ))
I just want to see it before completely move to bit shifting...
I have to start new discussion about audio filters, seems I do it wrong way. Simple counters aren't simple (for performance)
Post 15 Jun 2021, 11:26
View user's profile Send private message Visit poster's website Reply with quote
Overclick



Joined: 11 Jul 2020
Posts: 394
Location: Ukraine
Overclick
Finally it runs few times faster than APO I used before. No need to panic. I did just global optimisations and it gives impressive performance. A lot of little things I can improve locally too.
Post 15 Jun 2021, 20:06
View user's profile Send private message Visit poster's website Reply with quote
Melissa



Joined: 12 Apr 2012
Posts: 125
Melissa
Overclick wrote:
All of us need to know opcode length and how many cycles instructions need. Where can I find this simple but very important information? All I've got this picture only. I don't ask about parallel optimisation and jump predictor blah-blah-blah. I just want to know cycles cost.
How bit shifting works for example? Will it take cycle per each shifted bit or it works complexly for all of them?


You shouldnćt think about microoptimisations, that is job for compiler.
Human bits compiler by making faster algorithm
Post 18 Sep 2021, 11:53
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.