flat assembler
Message board for the users of flat assembler.

Index > Heap > IPC on modern processors

Author
Thread Post new topic Reply to topic
Hrstka



Joined: 05 May 2008
Posts: 18
Location: Czech republic
Hrstka
Recently, I've been playing with AMD μProf in order to speed up some code. I have noticed that IPC (instructions per cycle) is mostly between 2-3 for compiler generated code, whereas my hand-optimized assembler code can execute about 4-7 instructions per cycle. So I would like to ask if it's possible to reach even higher IPC or if I'm already near the limits.
Post 03 Feb 2020, 15:11
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17248
Location: In your JS exploiting you and your system
revolution
You won't be able to get the IPC[*] above the number of execution ports in the CPU core. If your CPU core has six execution ports then your maximum IPC will be six. And note that not all the ports are equal, some are ALU, some are LOAD or STORE, maybe one for FPU, etc. So you often need to mix the instruction types to achieve the maximum and saturate all the ports continuously.

The yet to be released Mill CPU is claimed to achieve an IPC of 30+.

* Some CPUs have the ability to merge two instructions into one issue cycle. For example CMP+Jcc could be issued in a single cycle to a single port. It depends upon your CPU as to whether it can do this.
Post 03 Feb 2020, 16:16
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2900
Location: [RSP+8*5]
bitRAKE
I'm amazed at the improvements which continue to be made. I wouldn't count instruction prefixes but Ryzen can discard any number of redundant prefixes during decode. I'm not sure how to test it, but I think some instructions never make it to the ports.

There is a class of instructions that only require renaming and AMD has also removed the false dependencies. Things like: mov* reg0,reg1; xor* reg,reg; cmpeq* reg,reg. It should be possible to get 8 IPC out of the op cache on Zen2. Sustaining that doing anything useful is basically impossible.

https://en.wikichip.org/wiki/File:zen_2_core_diagram.svg

_________________
¯\(°_o)/¯ unlicense.org
Post 04 Feb 2020, 00:31
View user's profile Send private message Visit poster's website Reply with quote
Hrstka



Joined: 05 May 2008
Posts: 18
Location: Czech republic
Hrstka
Intel apparently uses CPI, which is the reciprocal value of IPC. According to this web page CPI value of 0.75 is considered good, but I think this information is a bit outdated. Even old Pentiums 20 years ago were able to execute 2 instructions in one cycle. Anyway, thanks for the comments, my Ryzen processor probably has 8 execution ports, so I guess that IPC above 5.5 is actually pretty good.
Post 05 Feb 2020, 12:08
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2900
Location: [RSP+8*5]
bitRAKE
Well, let us design an artificial loop to test:
Code:
align 16
_more:
.
.
.
sub rcx,1
jnz _more    
We know the SUB/JNZ gets fused into one op, and we can easily remove the timing accuracy by increasing the loop count. How much can we put in the loop and still execute in a single cycle? The class of instructions I mentioned is at the top of the utilization list. NOP is in there and all the other reg/reg XCHGs.

From there we can build up a library of kernels with efficient execution properties. Most of these kernels will be completely worthless. Exploiting symmetries we can quickly prune the search space.

Zen2 has 11 ports; 7 integer & 4 float (ref. to the diagram). They are designed for SMT though. If compiled code is only getting 3-4 IPC then memory is more of a bottleneck than execution resources - unless all the threads are trying to do the same thing - SMT works best with heterogeneous workloads.

Correctly scheduling the U/V pipes on Pentiums was an art form. Two IPC was an asymptote few got close to in small loops. Here is a page that shows some of that work (starts about 1/3 down the page). http://www.azillionmonkeys.com/qed/asmexample.html

_________________
¯\(°_o)/¯ unlicense.org
Post 06 Feb 2020, 05:58
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17248
Location: In your JS exploiting you and your system
revolution
In general, to take advantage of alignment you would want to align to a multiple of the cache line size. For current desktop and laptop CPUs that is probably going to be 64 bytes. But check your CPU directly to confirm that.

The cache line size is readable at runtime, so you can have a few prewritten sections of code that are aligned to different line sizes and select which of those to execute to match the line size.
Post 06 Feb 2020, 06:55
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2900
Location: [RSP+8*5]
bitRAKE
Interesting paper:

Engineering Faster Sorters for Small Sets of Items
https://arxiv.org/abs/2002.05599

Instruction cache pressure could be reduced by using only 32-bit compares, with the high DWORD being the reference values.

What about sorting byte/word values in SIMD registers? Need some pre-processing to adapt the data to the problem. Think it's already been done on the GPUs.
Post 15 Feb 2020, 01:18
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You can attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar.

Powered by rwasa.