flat assembler
Message board for the users of flat assembler.
Index
> Main > Intel plans doubling 16 general purpose registers to 32 Goto page Previous 1, 2, 3, 4 Next |
Author |
|
revolution 02 Aug 2023, 14:11
You are very focussed upon the instruction timing as some sort of proxy for performance. Have you ever measured the performance of a real application and deliberately used "slow" instructions to see how things change? If not, then you might be surprised to learn that for many applications it makes no difference. There is a lot more to application performance than a naive reading of the instruction latency/throughout tables.
You appear to be arguing against something that hasn't even happened. Please show your working as to how you come to the conclusion that Intel will kill themselves and let AMD "win". Perhaps AMD will also do this, then do we boycott everything? |
|||
02 Aug 2023, 14:11 |
|
bitshifter 03 Aug 2023, 04:31
more registers = less memory IO = more performance
|
|||
03 Aug 2023, 04:31 |
|
revolution 03 Aug 2023, 05:13
bitshifter wrote: more registers = less memory IO = more performance I wonder how long it will take till anyone sees some actual silicon. |
|||
03 Aug 2023, 05:13 |
|
bitRAKE 03 Aug 2023, 05:35
If someone wanted to dive in to improve their understanding of how the instruction metrics are used generate code. Often the rules and metrics are provided by the manufacture. (See the *.td files.)
Edit: AMD and Intel take the LLVM architecture and roll their own compiler. I'm assuming they forward changes back to LLVM. _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
03 Aug 2023, 05:35 |
|
Roman 03 Aug 2023, 08:57
The processor tracks these new instructions internally and fast-forwards
register data between matching PUSH2 and POP2 instructions without going through memory. Interesting Push2 What buffer size array for push2 ? 16 values or 32 values or more values ? And if push2 not using stack, sometimes we could not do pop2 and program not crashing. |
|||
03 Aug 2023, 08:57 |
|
bitRAKE 03 Aug 2023, 16:13
Code: {ppx} push2 rcx, rdx ; PPX hint {ppx} pop2 rdi, rsi ; PPX hint Why not just: Code: mov rdi, rdx mov rsi, rcx ... hint is also suppose to work with: Code: {ppx} push rdx {ppx} pop rdi The way I'm reading it, is that PPX hint can be ignored - like the old jump hints. For example, maybe the efficiency core lack the processing for PPX hint. That's a good question about the depth though! Should we do: Code: {ppx} push rcx {ppx} push rdx {ppx} pop rdi {ppx} pop rsi Code: {ppx} push rcx {ppx} pop rsi {ppx} push rdx {ppx} pop rdi _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
03 Aug 2023, 16:13 |
|
revolution 03 Aug 2023, 16:29
Roman wrote: And if push2 not using stack, sometimes we could not do pop2 and program not crashing. The improvement is that you can ask the CPU to keep the values in an internal buffer. The CPU might choose to ignore the request and simply put them on the stack, or if it wants it can use an internal buffer to reduce memory/cache bandwidth usage. Either way, your code doesn't need to know, it can pretend it is all on the stack and it will all work fine. You can think of it as just another form of cache but it is optimised for push/pop. |
|||
03 Aug 2023, 16:29 |
|
Furs 04 Aug 2023, 11:09
bitRAKE wrote:
Basically you push one reg in the prolog and restore it in epilog, that's what it's for. Or, say: Code: {ppx} push rdx call blah {ppx} pop rdx |
|||
04 Aug 2023, 11:09 |
|
revolution 04 Aug 2023, 11:51
Furs wrote: ... but I don't know why the CPU can't do [hints] by itself. Code: push eax ebx ecx ; <-- if you "hint" these then it is worse call c_function add esp,12 ; <--- no matching pops, so hints are useless and wasteful |
|||
04 Aug 2023, 11:51 |
|
bitRAKE 04 Aug 2023, 17:45
Furs wrote: Hints are supposed to be for same registers, no? They don't have to follow each other directly. Sorry, I was being terse. Maybe this does integrate with existing stack optimizations in a broader way. Notice how it was explicitly stated that unbalanced hinting may turn off PPX optimization. Had me curious: At what level would this be implemented? _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
04 Aug 2023, 17:45 |
|
edfed 05 Aug 2023, 15:33
maybe it's time for intel to switch to a new architecture and break with the retrocomatibility principle.
a CPU with 16 banks of 16 registers, each composed by 16 parts of 16 bits can be very cool to work with, especially if we can do fast 4 bits operations anywhere in this matrix. like take a nibble from r16 as a N variable, to index the data from rN and so on. |
|||
05 Aug 2023, 15:33 |
|
Furs 05 Aug 2023, 16:32
edfed wrote: maybe it's time for intel to switch to a new architecture and break with the retrocomatibility principle. |
|||
05 Aug 2023, 16:32 |
|
Furs 05 Aug 2023, 16:33
revolution wrote:
|
|||
05 Aug 2023, 16:33 |
|
tthsqe 06 Aug 2023, 00:59
With all of this talk about about push and pop hinting, I am wondering if is implemented as a cyclical buffer that overwrites old values when the buffer (of say, size 512) overflows or wraps around. I suppose Anger Fog will have all of the details once the stuff actually comes out.
Since we are on the topic of cpus, can any one confirm/deny the following floating point latencies (at least with intel): 1. add used to be 3 cycles, mul and fmadd used to be 5 cycles 2. at some point there was a small blip where all three of add, mul, and fmadd were 4 cycles 3. now add is 2 cycles and mul and fmadd remain 4 cycles EDIT: I am not seeing much difference in the latencies of add and mul on rocket lake, and I do not know what a cycle is. |
|||
06 Aug 2023, 00:59 |
|
bitRAKE 06 Aug 2023, 18:32
tthsqe wrote: I do not know what a cycle is. How can we have an IPC (instructions per cycle) greater than one? Multiple instruction are fetched/decoded/execute/retired at once. Why do I not see any difference in timing with changes at the instruction level? The same port may execute FADD, FMUL, etc. and do so in the same amount of time. Or, other resource pressures could dictate synchronization. (Instruction changes shadowed by bubbles in the pipeline.) _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
06 Aug 2023, 18:32 |
|
revolution 06 Aug 2023, 18:50
A cycle nowadays isn't strictly a synchronisation point any more.
The cycle can be better thought of as the clock frequency. Since the clock travels across the chip more akin to a wave then each part of the chip is slightly out of phase with other parts nearby, and regions further away will be larger amounts out of phase. In some chips the phase timing can be more than one cycle delayed. This is why it is common for CPUs to run slower when debugging is enabled. Because it has to wait for distant parts of the chip to catch up before it can proceed to the next step, else the debug state will be wrongly captured. Also, stop trying to use individual instruction cycle times as a proxy for how fast the CPU will progress. Measure, don't guess. |
|||
06 Aug 2023, 18:50 |
|
bitRAKE 06 Aug 2023, 20:21
Code: +-------------------+ +----------------------+ | Theoretical | | Empirical | | Understanding +-----> Measurement | | (Knowledge-based) | | (Data-driven) | +-------------------+ +----------------------+ ^ | | v +-------+-------+ +-------+-------+ | Informed | | Verify & | | Decisions |<------| Adjust | +---------------+ +---------------+ _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
06 Aug 2023, 20:21 |
|
tthsqe 07 Aug 2023, 02:17
I am bit confused here. I've got the code. I am not guessing at much and am measuring most things. Let me also say that rdtsc seems to be quite consistent on my box. Before yall say something to the effect of making sure that rdtsc is not executed OoO, let me just say that the time spent between the two calls to rdtsc is large enough that this is not relevant.
Now, when timing big complicated things, is seems like a slightly useful exercise to find out nominal latencies for the basic double precision operations. I warmup the cpu, put vaddsd in a loop that runs n times and make sure that the inputs of the next iteration depend on the outputs of the previous iteration to avoid more OoO shenanigans. Divide the rdtsc difference by n, and voila. addsd: 2.5 rdtsc mulsd: 2.5 rdtsc divsd: 8.9 rdtsc Now, intel's online intrinsic guide lists latencies 4,4,14 respectively (I apparently don't have the 2 cycle add). This indicates that intel's bogo cycles are 1.6 faster than my bogo rdtsc cycles assuming a warm cpu etc. |
|||
07 Aug 2023, 02:17 |
|
tthsqe 07 Aug 2023, 02:21
I guess the real question is: given that add supposedly used to be faster than mul (3/5), why didn't I hear anyone complaining that add and mul now have the same latency?
|
|||
07 Aug 2023, 02:21 |
|
Goto page Previous 1, 2, 3, 4 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.