flat assembler
Message board for the users of flat assembler.

Index > Main > Intel plans doubling 16 general purpose registers to 32

Goto page Previous  1, 2, 3, 4  Next
Author
Thread Post new topic Reply to topic
Furs



Joined: 04 Mar 2016
Posts: 2542
Furs 02 Aug 2023, 13:16
revolution wrote:
If it isn't about performance then I am at a loss what it is about.
If something is micro-coded, you can bet your ass they didn't do it for performance.
Post 02 Aug 2023, 13:16
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20410
Location: In your JS exploiting you and your system
revolution 02 Aug 2023, 14:11
You are very focussed upon the instruction timing as some sort of proxy for performance. Have you ever measured the performance of a real application and deliberately used "slow" instructions to see how things change? If not, then you might be surprised to learn that for many applications it makes no difference. There is a lot more to application performance than a naive reading of the instruction latency/throughout tables.

You appear to be arguing against something that hasn't even happened. Please show your working as to how you come to the conclusion that Intel will kill themselves and let AMD "win". Perhaps AMD will also do this, then do we boycott everything? Razz
Post 02 Aug 2023, 14:11
View user's profile Send private message Visit poster's website Reply with quote
bitshifter



Joined: 04 Dec 2007
Posts: 796
Location: Massachusetts, USA
bitshifter 03 Aug 2023, 04:31
more registers = less memory IO = more performance
Post 03 Aug 2023, 04:31
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20410
Location: In your JS exploiting you and your system
revolution 03 Aug 2023, 05:13
bitshifter wrote:
more registers = less memory IO = more performance
Yeah, that's the theory. Intel claims 10+%.

I wonder how long it will take till anyone sees some actual silicon.
Post 03 Aug 2023, 05:13
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4060
Location: vpcmpistri
bitRAKE 03 Aug 2023, 05:35
If someone wanted to dive in to improve their understanding of how the instruction metrics are used generate code. Often the rules and metrics are provided by the manufacture. (See the *.td files.)

Edit: AMD and Intel take the LLVM architecture and roll their own compiler. I'm assuming they forward changes back to LLVM.

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 03 Aug 2023, 05:35
View user's profile Send private message Visit poster's website Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 1821
Roman 03 Aug 2023, 08:57
The processor tracks these new instructions internally and fast-forwards
register data between matching PUSH2 and POP2 instructions without going
through memory.

Interesting Push2
What buffer size array for push2 ?
16 values or 32 values or more values ?
And if push2 not using stack, sometimes we could not do pop2 and program not crashing.
Post 03 Aug 2023, 08:57
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4060
Location: vpcmpistri
bitRAKE 03 Aug 2023, 16:13
Code:
{ppx} push2 rcx, rdx ; PPX hint
{ppx} pop2  rdi, rsi ; PPX hint    

Why not just:
Code:
mov rdi, rdx
mov rsi, rcx    
(Are register move instructions going to disappear?)
... hint is also suppose to work with:
Code:
{ppx} push rdx
{ppx} pop rdi    
... but they need the REX2 prefix.

The way I'm reading it, is that PPX hint can be ignored - like the old jump hints. For example, maybe the efficiency core lack the processing for PPX hint.

That's a good question about the depth though! Should we do:
Code:
{ppx} push rcx
{ppx} push rdx
{ppx} pop rdi
{ppx} pop rsi    
... or ...
Code:
{ppx} push rcx
{ppx} pop rsi
{ppx} push rdx
{ppx} pop rdi    
... I think it's the later - only one hint depth - may as well use push2/pop2.

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 03 Aug 2023, 16:13
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20410
Location: In your JS exploiting you and your system
revolution 03 Aug 2023, 16:29
Roman wrote:
And if push2 not using stack, sometimes we could not do pop2 and program not crashing.
It is on the stack, and your code won't crash, the semantics are still the same.

The improvement is that you can ask the CPU to keep the values in an internal buffer. The CPU might choose to ignore the request and simply put them on the stack, or if it wants it can use an internal buffer to reduce memory/cache bandwidth usage. Either way, your code doesn't need to know, it can pretend it is all on the stack and it will all work fine.

You can think of it as just another form of cache but it is optimised for push/pop.
Post 03 Aug 2023, 16:29
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2542
Furs 04 Aug 2023, 11:09
bitRAKE wrote:
Code:
{ppx} push2 rcx, rdx ; PPX hint
{ppx} pop2  rdi, rsi ; PPX hint    

Why not just:
Code:
mov rdi, rdx
mov rsi, rcx    
(Are register move instructions going to disappear?)
... hint is also suppose to work with:
Code:
{ppx} push rdx
{ppx} pop rdi    
... but they need the REX2 prefix.

The way I'm reading it, is that PPX hint can be ignored - like the old jump hints. For example, maybe the efficiency core lack the processing for PPX hint.

That's a good question about the depth though! Should we do:
Code:
{ppx} push rcx
{ppx} push rdx
{ppx} pop rdi
{ppx} pop rsi    
... or ...
Code:
{ppx} push rcx
{ppx} pop rsi
{ppx} push rdx
{ppx} pop rdi    
... I think it's the later - only one hint depth - may as well use push2/pop2.
Hints are supposed to be for same registers, no? They don't have to follow each other directly.

Basically you push one reg in the prolog and restore it in epilog, that's what it's for. Or, say:
Code:
{ppx} push rdx
call blah
{ppx} pop rdx    
Cool feature, honestly, but I don't know why the CPU can't do it by itself. Hints on jumps are ignored for a reason.
Post 04 Aug 2023, 11:09
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20410
Location: In your JS exploiting you and your system
revolution 04 Aug 2023, 11:51
Furs wrote:
... but I don't know why the CPU can't do [hints] by itself.
Because the CPU doesn't know what you want.
Code:
push eax ebx ecx ; <-- if you "hint" these then it is worse
call c_function
add esp,12       ; <--- no matching pops, so hints are useless and wasteful    
Post 04 Aug 2023, 11:51
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4060
Location: vpcmpistri
bitRAKE 04 Aug 2023, 17:45
Furs wrote:
Hints are supposed to be for same registers, no? They don't have to follow each other directly.
The little documentation we have doesn't indicate the hint pairs should be the same register.

Sorry, I was being terse. Maybe this does integrate with existing stack optimizations in a broader way. Notice how it was explicitly stated that unbalanced hinting may turn off PPX optimization. Had me curious: At what level would this be implemented?

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 04 Aug 2023, 17:45
View user's profile Send private message Visit poster's website Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4352
Location: Now
edfed 05 Aug 2023, 15:33
maybe it's time for intel to switch to a new architecture and break with the retrocomatibility principle.
a CPU with 16 banks of 16 registers, each composed by 16 parts of 16 bits can be very cool to work with, especially if we can do fast 4 bits operations anywhere in this matrix.

like take a nibble from r16 as a N variable, to index the data from rN and so on.
Post 05 Aug 2023, 15:33
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2542
Furs 05 Aug 2023, 16:32
edfed wrote:
maybe it's time for intel to switch to a new architecture and break with the retrocomatibility principle.
I hate to sound like a broken record but… Itanium. Smile
Post 05 Aug 2023, 16:32
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2542
Furs 05 Aug 2023, 16:33
revolution wrote:
Furs wrote:
... but I don't know why the CPU can't do [hints] by itself.
Because the CPU doesn't know what you want.
Code:
push eax ebx ecx ; <-- if you "hint" these then it is worse
call c_function
add esp,12       ; <--- no matching pops, so hints are useless and wasteful    
Yeah, I suppose the call between the push/pop can be very large so it can't look ahead that much. Otherwise it would be easy for it to track, it already has very long pipelines.
Post 05 Aug 2023, 16:33
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 06 Aug 2023, 00:59
With all of this talk about about push and pop hinting, I am wondering if is implemented as a cyclical buffer that overwrites old values when the buffer (of say, size 512) overflows or wraps around. I suppose Anger Fog will have all of the details once the stuff actually comes out.
Since we are on the topic of cpus, can any one confirm/deny the following floating point latencies (at least with intel):
1. add used to be 3 cycles, mul and fmadd used to be 5 cycles
2. at some point there was a small blip where all three of add, mul, and fmadd were 4 cycles
3. now add is 2 cycles and mul and fmadd remain 4 cycles

EDIT: I am not seeing much difference in the latencies of add and mul on rocket lake, and I do not know what a cycle is.
Post 06 Aug 2023, 00:59
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4060
Location: vpcmpistri
bitRAKE 06 Aug 2023, 18:32
tthsqe wrote:
I do not know what a cycle is.
It's important to start here so that we know the units being measured. A concise way to put it: a cycle is the smallest synchronizing time-step used by the processor. Now that we have a synchronizing time-step, several internal processes can be run in parallel (regardless of how long they take) because they'll be synchronized later.

How can we have an IPC (instructions per cycle) greater than one?

Multiple instruction are fetched/decoded/execute/retired at once.

Why do I not see any difference in timing with changes at the instruction level?

The same port may execute FADD, FMUL, etc. and do so in the same amount of time. Or, other resource pressures could dictate synchronization. (Instruction changes shadowed by bubbles in the pipeline.)

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 06 Aug 2023, 18:32
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20410
Location: In your JS exploiting you and your system
revolution 06 Aug 2023, 18:50
A cycle nowadays isn't strictly a synchronisation point any more.

The cycle can be better thought of as the clock frequency. Since the clock travels across the chip more akin to a wave then each part of the chip is slightly out of phase with other parts nearby, and regions further away will be larger amounts out of phase. In some chips the phase timing can be more than one cycle delayed.

This is why it is common for CPUs to run slower when debugging is enabled. Because it has to wait for distant parts of the chip to catch up before it can proceed to the next step, else the debug state will be wrongly captured.

Also, stop trying to use individual instruction cycle times as a proxy for how fast the CPU will progress. Measure, don't guess.
Post 06 Aug 2023, 18:50
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4060
Location: vpcmpistri
bitRAKE 06 Aug 2023, 20:21
Code:
+-------------------+     +----------------------+
| Theoretical       |     | Empirical            |
| Understanding     +-----> Measurement          |
| (Knowledge-based) |     | (Data-driven)        |
+-------------------+     +----------------------+
            ^                      |
            |                      v
    +-------+-------+       +-------+-------+
    | Informed      |       | Verify &       |
    | Decisions     |<------| Adjust         |
    +---------------+       +---------------+    
... focusing on either measurements or instruction selection is just a one-sided discussion. Lacking code means that talking about measurements is very limited. Writing code means we need to reason at the instruction level. Also, the applicability of electrical characteristics is too granular for a general discussion (about programming) - next you'll want to include relativistic effects. Rolling Eyes

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 06 Aug 2023, 20:21
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 07 Aug 2023, 02:17
I am bit confused here. I've got the code. I am not guessing at much and am measuring most things. Let me also say that rdtsc seems to be quite consistent on my box. Before yall say something to the effect of making sure that rdtsc is not executed OoO, let me just say that the time spent between the two calls to rdtsc is large enough that this is not relevant.
Now, when timing big complicated things, is seems like a slightly useful exercise to find out nominal latencies for the basic double precision operations. I warmup the cpu, put vaddsd in a loop that runs n times and make sure that the inputs of the next iteration depend on the outputs of the previous iteration to avoid more OoO shenanigans. Divide the rdtsc difference by n, and voila.
addsd: 2.5 rdtsc
mulsd: 2.5 rdtsc
divsd: 8.9 rdtsc
Now, intel's online intrinsic guide lists latencies 4,4,14 respectively (I apparently don't have the 2 cycle add). This indicates that intel's bogo cycles are 1.6 faster than my bogo rdtsc cycles assuming a warm cpu etc.
Post 07 Aug 2023, 02:17
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 07 Aug 2023, 02:21
I guess the real question is: given that add supposedly used to be faster than mul (3/5), why didn't I hear anyone complaining that add and mul now have the same latency? Smile
Post 07 Aug 2023, 02:21
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3, 4  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.