flat assembler
Message board for the users of flat assembler.

Index > Main > Intel plans doubling 16 general purpose registers to 32

Goto page Previous  1, 2, 3, 4  Next
Author
Thread Post new topic Reply to topic
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20413
Location: In your JS exploiting you and your system
revolution 29 Jul 2023, 22:27
sylware wrote:
CMOVcc memory access fixed?
They are new instructions. So they only exist with REX2.

The current encoding for cmovcc still has the "broken" behaviour we have come to know and love. Razz
Post 29 Jul 2023, 22:27
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4060
Location: vpcmpistri
bitRAKE 29 Jul 2023, 23:00
Intel has tipped their hand at how they will be implementing the 64-bit only processor.

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 29 Jul 2023, 23:00
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2542
Furs 30 Jul 2023, 13:58
revolution wrote:
Furs wrote:
Sure, loads take offset to encode as well (from rbp or rsp), but they said 10% loads versus doubling the regs. So smells like bloat to me.
That is a consequence of the diminishing returns of providing more registers. You don't get half the loads by doubling registers. This applies regardless of how you encode the instructions. All architectures experience this same effect.

Going from 1 reg to 2 regs, gives a great boost. Going from 2 to 4 a good boost. From 4 to 8 a moderate boost. etc. ... from 1G regs to 2G regs you get effectively to zero benefit (and probably a big loss from all the overheads).
Yeah. In this context, 16 was perfectly fine, at least for GPRs, IMO.

Not sure what 3 operand instructions are for, considering moves are renamed and cost 0 cycles, not like you have a million moves due to destination operand. But at least it won't bloat the instruction stream since the mov costs bytes to encode as well.

I have a bad feeling Intel will just disable a lot of the current optimizations, expecting you to use their new bullshit instructions/encodings, and so current code will run like shit on their new CPUs.

In that case I will permanently switch to AMD until it changes.
Post 30 Jul 2023, 13:58
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20413
Location: In your JS exploiting you and your system
revolution 31 Jul 2023, 01:22
Furs wrote:
Not sure what 3 operand instructions are for, considering moves are renamed and cost 0 cycles, not like you have a million moves due to destination operand. But at least it won't bloat the instruction stream since the mov costs bytes to encode as well.
Not sure what your beef is. If you don't want to "bloat the instruction stream" then simply don't use any of the new instructions or registers. Your code can continue to be "un-bloated" and you don't have to do anything different.
Furs wrote:
I have a bad feeling Intel will just disable a lot of the current optimizations, expecting you to use their new bullshit instructions/encodings, and so current code will run like shit on their new CPUs.

In that case I will permanently switch to AMD until it changes.
What is your basis for assuming Intel will sabotage themselves by making their CPUs undesirable? That makes no sense whatsoever. And if AMD don't follow, and it becomes Intel-only, then the whole thing would be kind of useless. So possibly AMD will have it also and then you have nowhere to go to avoid the "bloat". Smile
Post 31 Jul 2023, 01:22
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20413
Location: In your JS exploiting you and your system
revolution 31 Jul 2023, 03:36
revolution wrote:
sylware wrote:
CMOVcc memory access fixed?
They are new instructions. So they only exist with REX2.
REX2 is not correct. The CFMOVcc are only enabled in the Extended EVEX (EEVEX?) encodings.
Post 31 Jul 2023, 03:36
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2542
Furs 31 Jul 2023, 13:04
revolution wrote:
Not sure what your beef is. If you don't want to "bloat the instruction stream" then simply don't use any of the new instructions or registers. Your code can continue to be "un-bloated" and you don't have to do anything different.
And I'm allowed to complain and explain why they're stupid and I wouldn't use them in the first place. Rolling Eyes

Is this social media where only "upvotes" exist and people aren't allowed to express disapproval now?
revolution wrote:
What is your basis for assuming Intel will sabotage themselves by making their CPUs undesirable?
Itanium.

In hindsight, it's easy to act like a know-it-all of why it failed. But that was not the sentiment back then, except for a few people like me. Most people were super hyped about it. Guess how it turned out?
Post 31 Jul 2023, 13:04
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20413
Location: In your JS exploiting you and your system
revolution 31 Jul 2023, 13:32
Furs wrote:
And I'm allowed to complain and explain why they're stupid and I wouldn't use them in the first place. Rolling Eyes
No one said you can't complain. But your "complaint" is silly. You aren't forced to "bloat" your code at all. You can choose to "bloat" your code if you want to, by using R16-R31, or the NDD thing. And you can also choose to never use R16-R31 or NDD. How does adding choice hurt you?
Furs wrote:
Itanium
You are moving the goal posts and talking about a different thing. Your suggestion above was that Intel would deliberately make non-REX2 stuff worse. But there is at least one competitor, AMD, so any deliberate reduction in performance would be suicide. That is entirely different form Itanium, a whole new architecture, with no competitor, and no idea if it would be work. So let's get back to the original question before you distracted the discussion, what makes you think Intel will sabotage themselves by deliberately making their stuff worse than their competitor's?
Post 31 Jul 2023, 13:32
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2542
Furs 01 Aug 2023, 13:12
revolution wrote:
No one said you can't complain. But your "complaint" is silly. You aren't forced to "bloat" your code at all. You can choose to "bloat" your code if you want to, by using R16-R31, or the NDD thing. And you can also choose to never use R16-R31 or NDD. How does adding choice hurt you?
When did I say I was forced to use them? That was not the reason I didn't plan on buying a new Intel CPU. That was part of explaining why they wouldn't be so useful after all.
revolution wrote:
You are moving the goal posts and talking about a different thing. Your suggestion above was that Intel would deliberately make non-REX2 stuff worse. But there is at least one competitor, AMD, so any deliberate reduction in performance would be suicide. That is entirely different form Itanium, a whole new architecture, with no competitor, and no idea if it would be work. So let's get back to the original question before you distracted the discussion, what makes you think Intel will sabotage themselves by deliberately making their stuff worse than their competitor's?
Ehm, I don't think you understood my implications of mentioning it. Itanium could run x86 code via builtin emulator. It was slow though. And AMD was a competitor back then as well. And they still did it. Wink

Intel could make it "slow" to get more transistor budget for this new bloated bs and since a reason could be "we now have 3 operand instructions, no need for mov to be fast". It's not rocket science to figure it out.

Look what happened to x87 because it has a "replacement" (SSE). So what about old apps using x87 and their performance? They had no reason huh?
Post 01 Aug 2023, 13:12
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20413
Location: In your JS exploiting you and your system
revolution 01 Aug 2023, 14:45
If you want to suggest that Intel intends to use REX2/EEVEX to replace all other encodings, then you have to show your evidence. If you did replace everything with REX2/EEVEX then currently you will only get a very small subset of available instructions, and almost none of the "normal" simple instructions.

So your argument that it is like Itanium makes no sense. Itanium was a replacement, not an extension. It didn't work out, but that is the way of things, sometimes things just don't go as planned.

AMD came along and extended x86 with the x86-64. That was a great success. And now Intel have extended x86-64 to this new thing. Suggesting that Intel will sabotage themselves by somehow making their CPUs worse is silly, like I mentioned, it has no basis in reality. Plus the estimates of 10% improvement makes your whole argument moot. If it is "bloat", then it is bloat that works to make stuff better. Embrace the bloat if it works, reject the bloat if it harms.
Post 01 Aug 2023, 14:45
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4060
Location: vpcmpistri
bitRAKE 01 Aug 2023, 14:55
My perspective on Intel is little different, but with a similar conclusion:

Intel has optimized for their business position. Which means building processors for compilers - initially this meant their own compiler, but later it follows from compiler research more generally. The majority of code is compiled - so, this is an efficient way to produce better results for their customers.

Do they intentionally have poor performance elsewhere in their processor designs? No, this is a result of low priority and neglect. Could they do better. Sure.

To make it more concrete, let us look at just control flow instructions. Compilers don't use LOOPcc or J[RE]CXZ == very low priority for Intel. AMD has a more "wholistic" approach in their design, imho. Which results in these instructions still being performant. This isn't something new - it's been this way for decades. The "knock-on" effect is that compilers aren't going to use these low priority instructions in the future either. (Should be a caveat here, but that's another discussion.)
Code:
uops.info - Table
                        Alder Lake-P                                            AMD Zen+/2/3/4
Instruction             Lat     TP              Uops    Ports                   Lat     TP      Uops    Ports
LOOP (Rel8)     BASE    2       2.00 / 4.94     7 / 6   1*p0156B+4*p06+1*p1     1       0.50    1       
LOOPE (Rel8)    BASE    [1;3]   3.00 / 6.00     12 / 10 3*p0156B+6*p06+1*p1     1       0.50    1       
LOOPNE (Rel8)   BASE    [1;3]   3.00 / 5.97     12 / 10 3*p0156B+6*p06+1*p1     1       0.50    1

JRCXZ (Rel8)    BASE    0.50 / 0.50     2 / 2   1*p0156B+1*p06                          0.50    1           
... best to view on https://www.uops.info/table.html
(LOOPcc on Intel is shit. Yet, on AMD same as CMP/Jcc - wow!)

Go through the whole ISA and find a similar result.

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 01 Aug 2023, 14:55
View user's profile Send private message Visit poster's website Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 1821
Roman 01 Aug 2023, 18:57
New 16 registers
And new 16 xmm16 to xmm31
Nice Smile
Post 01 Aug 2023, 18:57
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20413
Location: In your JS exploiting you and your system
revolution 01 Aug 2023, 19:28
Latency and throughput numbers are meaningless on their own.

It all depends upon how they interact within the entire code stream, mixed in with all the other instructions surrounding them. Combine that with the previous states and the content of the caches and buffers and things, that is where the real performance benefits and hazards come from.
Post 01 Aug 2023, 19:28
View user's profile Send private message Visit poster's website Reply with quote
Ali.Z



Joined: 08 Jan 2018
Posts: 726
Ali.Z 01 Aug 2023, 20:01
calling the option to use extended register set as a bloat is invalid.
why didnt someone complain against some instructions set rather than extensions?

if adding an option causes the enitre architecture to be slow, then modern CPUs should be slower than 8086.

what would you say, invalid comparision? surely it is for obvious reasons, I didnt take into account that modern CPUs are much faster, can execute instructions in parallel, OEEE, tiny transistors, different internal design, cahce... and among many other optimizations.

so if you call my arg invalid, then so does yours as you didnt take into account what intel would change... as if intel tells you and keeps you up to date with all of their internal secret design of the architecture, which is bs non sense. (and intel always been an ass in sharing details, docs, secrets, and when they say a word it is likely to be vague)

...

_________________
Asm For Wise Humans


Last edited by Ali.Z on 01 Aug 2023, 21:00; edited 2 times in total
Post 01 Aug 2023, 20:01
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4060
Location: vpcmpistri
bitRAKE 01 Aug 2023, 20:01
Interpreting my post as being about performance is to completely miss the point.

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 01 Aug 2023, 20:01
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20413
Location: In your JS exploiting you and your system
revolution 01 Aug 2023, 20:14
bitRAKE wrote:
Interpreting my post as being about performance is to completely miss the point.
Sorry if I misinterpreted. But this:
bitRAKE wrote:
Which results in these instructions still being performant. <snip latency and throughput numbers>
How does one not interpret that as a performance measure via latency and throughput numbers?
Post 01 Aug 2023, 20:14
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4060
Location: vpcmpistri
bitRAKE 01 Aug 2023, 20:34
revolution wrote:
bitRAKE wrote:
Which results in these instructions still being performant. <snip latency and throughput numbers>
How does one not interpret that as a performance measure via latency and throughput numbers?
By reading all the words that came before it.

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 01 Aug 2023, 20:34
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20413
Location: In your JS exploiting you and your system
revolution 01 Aug 2023, 20:49
bitRAKE wrote:
By reading all the words that came before it.
This doesn't increase my understanding.
bitRAKE wrote:
Do they intentionally have poor performance elsewhere in their processor designs? ...

To make it more concrete, let us look at just control flow instructions. <snip latency and throughput numbers>
If it isn't about performance then I am at a loss what it is about.

My English isn't perfect, but I think it is okay enough for most purposes. But your comment about it not being about performance completely baffles me. Confused
Post 01 Aug 2023, 20:49
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4060
Location: vpcmpistri
bitRAKE 01 Aug 2023, 21:08
If I had a thesis it would be, "Intel designs for the compiler and neglects instruction not used by the compiler." The metrics presented are an indication of this. Look at the other non-compiler instructions in the ISA and you will see a similar pattern.

We are both aware of the complexity of measuring performance, but to claim that LOOPcc performs similarly on Intel and AMD is dishonest. We don't need to get lost down that alley though - that's not the point. We can just look at non-compiler instructions to see what Intel does.
Post 01 Aug 2023, 21:08
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2542
Furs 02 Aug 2023, 13:12
revolution wrote:
If you want to suggest that Intel intends to use REX2/EEVEX to replace all other encodings, then you have to show your evidence. If you did replace everything with REX2/EEVEX then currently you will only get a very small subset of available instructions, and almost none of the "normal" simple instructions.

So your argument that it is like Itanium makes no sense. Itanium was a replacement, not an extension. It didn't work out, but that is the way of things, sometimes things just don't go as planned.

AMD came along and extended x86 with the x86-64. That was a great success. And now Intel have extended x86-64 to this new thing. Suggesting that Intel will sabotage themselves by somehow making their CPUs worse is silly, like I mentioned, it has no basis in reality. Plus the estimates of 10% improvement makes your whole argument moot. If it is "bloat", then it is bloat that works to make stuff better. Embrace the bloat if it works, reject the bloat if it harms.
No I'm talking about performance.

Itanium might have replaced x86, but the point from an end user perspective was that existing apps (x86) were slow, due to emulator. They don't care it was emulated. All they cared about is that they were slow, and to get performance they'd have to recompile for it.

So I simply said, if they drop optimizations (such as 0 latency move renames) because of this new crap (3 operand instructions for instance), then they will have very similar situation to Itanium.

Existing apps will be slow. "Recompiling" will make them fast. And so on.

Same with x87 (and I mean scalar SSE obviously). Existing apps would become slow (though much later down the line), "recompiling" to scalar SSE would make them fast, and so on.

What's so confusing about what I said?
Post 02 Aug 2023, 13:12
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2542
Furs 02 Aug 2023, 13:15
Ali.Z wrote:
calling the option to use extended register set as a bloat is invalid.
why didnt someone complain against some instructions set rather than extensions?
I did. x87 vs scalar SSE.

x87 is mostly micro-coded right now hence extremely slow.

You need more proof…?
Post 02 Aug 2023, 13:15
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3, 4  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.