flat assembler
Message board for the users of flat assembler.
Index
> Main > How did test speed VDPBF16PS ? |
Author |
|
Roman 10 Oct 2024, 15:47
How i understood VDPBF16PS its matrix multiply.
I want know how fast work VDPBF16PS. VDPBF16PS its avx512 instruction. |
|||
10 Oct 2024, 15:47 |
|
macomics 10 Oct 2024, 18:11
So do a loop of 1`000`000`000 iterations with and without this instruction and look at QueryPerformanceCounter
|
|||
10 Oct 2024, 18:11 |
|
macomics 10 Oct 2024, 20:03
Roman wrote: I not have avx 512. |
|||
10 Oct 2024, 20:03 |
|
revolution 10 Oct 2024, 20:08
macomics wrote: So do a loop of 1`000`000`000 iterations with and without this instruction and look at QueryPerformanceCounter Either way, the results of synthetic tests might be meaningless without the full context of how the instructions are used in the real application. There is more that goes into the run times of programs than just some isolated throughput and/or latency measurements . A lot more. CPUs are very complex. |
|||
10 Oct 2024, 20:08 |
|
macomics 10 Oct 2024, 20:20
revolution wrote: Either way, the results of synthetic tests might be meaningless without the full context of how the instructions are used in the real application. There is more that goes into the run times of programs than just some isolated throughput and/or latency measurements . A lot more. CPUs are very complex. |
|||
10 Oct 2024, 20:20 |
|
revolution 10 Oct 2024, 20:50
For an example of unexpected run times specifically with the AVX512 stuff. I just recently was updating code on some AVX512 servers. Based on the docs and test results I calculated that the AVX512 code would run faster than the existing AVX256 code.
So I coded it up and was surprised to discover that AVX512 ran slower than the old AVX256 code. It initially made no sense to me. All the synthetic tests I did, and the docs, clearly showed improvements in the new instructions. After much time and consternation I finally discovered that the reason was thermal and clocking. The CPUs reduced the clocking when AVX512 instructions are used, and the temperatures went up keeping the clocks low even after the AVX512 code had finished its run. Multi-thread AVX512 code is very hard on the power usage. Single thread code didn't see that happen. This was no doubt some combination of the CPU/mobo/cooling that we have here, and other systems might get opposite results. Trusting some values given in a test, or in a doc, might be a great way to reduce the performance of your code. And you would never know without testing and comparing. Last edited by revolution on 11 Oct 2024, 04:17; edited 1 time in total |
|||
10 Oct 2024, 20:50 |
|
Roman 11 Oct 2024, 00:56
On Sse matrix 4x4 multiply cost 74 asm commands.
On Avx512 one asm command. VDPBF16PS is usefull. But interesting how faster could be new TDPBF16PS(avx10.2)? |
|||
11 Oct 2024, 00:56 |
|
revolution 11 Oct 2024, 01:09
Roman wrote: But interesting how faster could be new TDPBF16PS(avx10.2)? |
|||
11 Oct 2024, 01:09 |
|
sinsi 11 Oct 2024, 03:22
Here's a good explanation on Stack Overflow
SIMD instructions lowering CPU frequency |
|||
11 Oct 2024, 03:22 |
|
Roman 12 Oct 2024, 07:59
New amd Cpu Ryzen 9 9950X supported Avx512 and VDPBF16PS
|
|||
12 Oct 2024, 07:59 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.