flat assembler
Message board for the users of flat assembler.
Index
> Windows > Mandelbrot Benchmark FPU/SSE2 released Goto page Previous 1, 2, 3 ... 18, 19, 20 |
Author |
|
Alphonso 24 Jan 2011, 12:14
AFAIK the Linpack benchmark with AVX can show better than 80% improvement over SSE. It would be interesting to see what it could do for Mandelbrot.
|
|||
24 Jan 2011, 12:14 |
|
tthsqe 27 Mar 2011, 08:17
Has anyone updated this CPU burn to AVX support yet?
|
|||
27 Mar 2011, 08:17 |
|
Madis731 28 Mar 2011, 09:19
yeah, that would be sweet!
|
|||
28 Mar 2011, 09:19 |
|
tthsqe 01 Apr 2011, 13:09
Ok! I'll see what I can do to modify the previously posted code.
I'll start a new thread in the projects section within the next week. |
|||
01 Apr 2011, 13:09 |
|
revolution 14 Mar 2012, 09:05
kalambong wrote: Can someone point me to the new thread that tthsqe mentioned above, please? http://board.flatassembler.net/topic.php?p=127809#127809 It is the very next post by tthsqe after the one above. |
|||
14 Mar 2012, 09:05 |
|
kalambong 01 Jun 2012, 06:33
Thanks !!
|
|||
01 Jun 2012, 06:33 |
|
rugxulo 01 Jun 2012, 16:20
You need at least Win7 SP1 to use AVX, right? <sarcasm> Anyways, AVX is obsolete, AVX2 is teh real dealz!!! </sarcasm>
|
|||
01 Jun 2012, 16:20 |
|
Vitor_boss 04 Jun 2012, 01:16
Intel Core i3 M330 2130MHz
Cores: 2 Threads: 4 FPU: 528,867 SSE2: 1493,606 SSE4.1: 1511,689 EDIT: Using KMB V 0.53I-32b-MT |
|||
04 Jun 2012, 01:16 |
|
Bernhard Schornak 21 Aug 2012, 09:07
AMD FX-8150 (3600 MHZ)
FPU : 1318.538 SSE2 : 4641.840 SSE4.1: 4711.906 Processors with one execution pipe per core greatly benefit from your code, because it performs multiple operations on one and the same register in a row. Hence, the results do not show how 'performant' a CPU/FPU combination really is. Your current code simply puts additional execution pipes to sleep. In the case of the FX-8xxx, three out of four pipes are not fed with appropriate food most of the time, rendering them quite useless. Reordering instructions properly can gain much better results for processors with multiple execution pipes. |
|||
21 Aug 2012, 09:07 |
|
Xorpd! 21 Aug 2012, 21:28
Serial operations on a single register is the path to using a big chunk of the physical register file given limited architected registers and out of order execution. In fact the numbers attained in this thread are not all that far off the maximum floating point throughput for pre-AVX processors.
The problem with Bulldozer is that an 8-core CPU only has 4 FPUs and you need to do as much work as possible using FMACs to attain maximum throughput. I am not sure that the code in the companion thread http://board.flatassembler.net/topic.php?p=127809#127809 does this; I think it's all written for Intel processors which would mean that it does not. You just can't schedule 3 or 4 instruction streams, expecially in 32-bit mode, without counting on the out of order properties of the processor to interleave instruction streams for you. Maybe with 128 FP registers, but that is an expensive processor and doesn't provide SIMD as far as I know. |
|||
21 Aug 2012, 21:28 |
|
tthsqe 25 Aug 2012, 02:30
hey - it looks like someone with a bulldozer chip!
Could you tell me how bulldozer performs on the program at http://board.flatassembler.net/topic.php?p=127809#127809? You have to hit "R" to change the calculation path used. Just explore a little bit and tell me the max GFLOPS (upper left) you encountered on each calculation path. |
|||
25 Aug 2012, 02:30 |
|
Kuemmel 27 Aug 2012, 16:57
I would be also curious if tthsqe's benchmark shows an improvement regarding the "Bulldozer" issue. My benchmark is let's say kind of obselete in times of AVX, but of course not everyone got an AVX chip yet so it's still relevant for those...
I can only state that my benchmark was never developed to favour either Intel or AMD. I tested reordering of instructions with old Semprons and Phenoms with almost no difference. I remember when I got the help from Xorpd regarding the multiple instruction streams the Intel's with Core and Core 2 architecture just where so much faster out of the box, while AMD's didn't do much. Phenom just picked up the speed because (I think) doubleing the path at that time. Either the out-of-order design or the overall amount of instruction units of the AMD's is just not as good as Intels. As Xorpd stated with AVX fused multiply add instructions it might be a different story. My benchmark maybe reflects more the past and current -non-AVX software where I would say Intel's FPU/SSE is just superior. |
|||
27 Aug 2012, 16:57 |
|
Goto page Previous 1, 2, 3 ... 18, 19, 20 < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.