flat assembler
Message board for the users of flat assembler.

Index > Main > How did test speed VDPBF16PS ?

Author
Thread Post new topic Reply to topic
Roman



Joined: 21 Apr 2012
Posts: 1852
Roman 10 Oct 2024, 15:47
How i understood VDPBF16PS its matrix multiply.
I want know how fast work VDPBF16PS.

VDPBF16PS its avx512 instruction.
Post 10 Oct 2024, 15:47
View user's profile Send private message Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 1043
Location: Russia
macomics 10 Oct 2024, 18:11
So do a loop of 1`000`000`000 iterations with and without this instruction and look at QueryPerformanceCounter
Post 10 Oct 2024, 18:11
View user's profile Send private message Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 1852
Roman 10 Oct 2024, 18:39
I not have avx 512.
For this reason I asking.
And today Intel anonsed new cpu arrow lake 9 285k with avx10.2 instructions.


Last edited by Roman on 11 Oct 2024, 00:59; edited 1 time in total
Post 10 Oct 2024, 18:39
View user's profile Send private message Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 1043
Location: Russia
macomics 10 Oct 2024, 20:03
Roman wrote:
I not have avx 512.
Then forget about this instruction. If you want to use it this way, then you must have a processor with its support and, now, you will have to do an availability check and software emulation. Otherwise, your program will not be executed by most users. In 5 or 10 years, maybe it will be universal. For now, it's just a caprice.
Post 10 Oct 2024, 20:03
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20459
Location: In your JS exploiting you and your system
revolution 10 Oct 2024, 20:08
macomics wrote:
So do a loop of 1`000`000`000 iterations with and without this instruction and look at QueryPerformanceCounter
Or can try with RDTSC also.

Either way, the results of synthetic tests might be meaningless without the full context of how the instructions are used in the real application. There is more that goes into the run times of programs than just some isolated throughput and/or latency measurements . A lot more. CPUs are very complex.
Post 10 Oct 2024, 20:08
View user's profile Send private message Visit poster's website Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 1043
Location: Russia
macomics 10 Oct 2024, 20:20
revolution wrote:
Either way, the results of synthetic tests might be meaningless without the full context of how the instructions are used in the real application. There is more that goes into the run times of programs than just some isolated throughput and/or latency measurements . A lot more. CPUs are very complex.
At least this is how you can understand the effectiveness in comparison with the emulation code of this instruction. But without hardware support, it's still nowhere. @Roman, don't waste your time. Write something else.
Post 10 Oct 2024, 20:20
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20459
Location: In your JS exploiting you and your system
revolution 10 Oct 2024, 20:50
For an example of unexpected run times specifically with the AVX512 stuff. I just recently was updating code on some AVX512 servers. Based on the docs and test results I calculated that the AVX512 code would run faster than the existing AVX256 code.

So I coded it up and was surprised to discover that AVX512 ran slower than the old AVX256 code. It initially made no sense to me. All the synthetic tests I did, and the docs, clearly showed improvements in the new instructions. After much time and consternation I finally discovered that the reason was thermal and clocking. The CPUs reduced the clocking when AVX512 instructions are used, and the temperatures went up keeping the clocks low even after the AVX512 code had finished its run. Multi-thread AVX512 code is very hard on the power usage. Single thread code didn't see that happen.

This was no doubt some combination of the CPU/mobo/cooling that we have here, and other systems might get opposite results.

Trusting some values given in a test, or in a doc, might be a great way to reduce the performance of your code. And you would never know without testing and comparing.


Last edited by revolution on 11 Oct 2024, 04:17; edited 1 time in total
Post 10 Oct 2024, 20:50
View user's profile Send private message Visit poster's website Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 1852
Roman 11 Oct 2024, 00:56
On Sse matrix 4x4 multiply cost 74 asm commands.
On Avx512 one asm command.
VDPBF16PS is usefull.

But interesting how faster could be new TDPBF16PS(avx10.2)?
Post 11 Oct 2024, 00:56
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20459
Location: In your JS exploiting you and your system
revolution 11 Oct 2024, 01:09
Roman wrote:
But interesting how faster could be new TDPBF16PS(avx10.2)?
That entirely depends upon your application.
Post 11 Oct 2024, 01:09
View user's profile Send private message Visit poster's website Reply with quote
sinsi



Joined: 10 Aug 2007
Posts: 794
Location: Adelaide
sinsi 11 Oct 2024, 03:22
Here's a good explanation on Stack Overflow
SIMD instructions lowering CPU frequency
Post 11 Oct 2024, 03:22
View user's profile Send private message Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 1852
Roman 12 Oct 2024, 07:59
New amd Cpu Ryzen 9 9950X supported Avx512 and VDPBF16PS
Post 12 Oct 2024, 07:59
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.