flat assembler
Message board for the users of flat assembler.

Index > Windows > Mandelbrot Benchmark FPU/SSE2 released

Goto page Previous  1, 2, 3 ... 9, 10, 11 ... 18, 19, 20  Next
Author
Thread Post new topic Reply to topic
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
Yup, it does quite well ! It went up from 191 MItPS to 214 MItPS !!! So roughly a gain of 12 %...all this stuff seems really CPU dependant, as I see...may be there would be a AMD Phenom optimized variant of yours, too ?
Post 06 Jan 2008, 01:22
View user's profile Send private message Visit poster's website Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd!
That's encouraging. The code is probably OK on Phenom as well because it's better equipped to handle the extra loads than Core 2 Duo is. From experience with the X3 versions we know that the same memory location should not be reused between instruction streams on Phenom. That is why I used two different locations, [.mov12] and [.mov34] in the snippet I posted.

Although unrolling may yield more performance improvement than going to 4 exits on your processor, the latter is more or less a prerequisite for the former, so that is the next logical step, IMO, if you choose to keep going.
Post 06 Jan 2008, 02:21
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
Hm, yep, it might be possible to close up the gap between AMD and Intel regarding the efficiency by optimizing for each. On my Sempron Quickman results are 303 MItPS (Intel code) compared to 342 MItPS (AMD code). So about 13%. For your X4 version the efficiency differs by about 26% between Core2Duo and Phenom.

The main differences for Quickman is the different divergence check (AMD: General purpose registers, INTEL: SSE2) and of course may be he didn't detect all possibilities in the other parts of the code.

My next updates might take a bit more time...x-mas-holidays are over and work's calling Wink So your basic recommendation is still to adapt both, an X3 and X4 version for 32bit ?
Post 06 Jan 2008, 10:27
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
I made something like Xorpd!s X3-Version, iterating 6 points in the main loop. I called it 0.53D. It finally matches the proposed performance gain of 100 %:

V0.53 - Core 2 Duo (1867 MHz): 519,338 MItPS
V0.53D - Core 2 Duo (1867 MHz): 1038,667 MItPS Smile

Actually there's still some air for optimizing, I'm still not using extra exits from the the main loop and not feeding new points if two diverged in the mainloop. I just applied the following what I would call "tree"-style code:

1.) Iterate 6 points -> 2 diverged ? -> exit
2.) Iterate remaining 4 points optimized with 2 more free registers -> 2 diverged ? -> exit
3.) Iterate remaining 2 points optimized with 2 more free registers -> 2 diverged ? -> finish -> drawing

I also use now PADDD to add the negative iteration counters, which saves an instruction per block, only before plotting I apply an reverse subtract to get it right. Also loop unrolling (1 time) is applied for each iteration loop. Furthermore there's an AMD-optimized version included from the hints Xorpd! gave.

So it's still not the end, the further stuff lies more in the code logic of Xorpd!'s versions, but I think now the air gets more thin for 32bit, as it's about only 30 % away from the X4-64bit version and the main work is done at the "black" points of the mandelbrot image, which runs in the mainloop perfectly.


Last edited by Kuemmel on 13 Jan 2008, 20:03; edited 1 time in total
Post 13 Jan 2008, 14:36
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2914
Location: [RSP+8*5]
bitRAKE
Pentium M Dothan 1.6Ghz

Kümmel Mandelbrot Benchmark V 0.53-MT-FPU
Speed [million iterations / second] : 107.168

Kümmel Mandelbrot Benchmark V 0.53D-32b-MT-SSE2_AMD
Speed [million iterations / second] : 167.095

Kümmel Mandelbrot Benchmark V 0.53D-32b-MT-SSE2_Intel
Speed [million iterations / second] : 171.514

Almost 25% faster compared to previous SSE2 version.
Post 13 Jan 2008, 18:18
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
Hi there,

thanks for testing...I got first strange evidence that the AMD Phenom might not be in line with the old AMD CPUs, in fact the Intel version is faster, so optimizing for old AMD architecture doesn't work for the Phenom...:

AMD Phenom 9600, quadcore 2,3GHz, Vista 32-bit
FPU - 584.967
SSE2-AMD 1638.633
SSE2-Intel 1847.717
Post 13 Jan 2008, 20:08
View user's profile Send private message Visit poster's website Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd!
Kümmel, since you have about 2X performance of the original SSE2 code on Core 2 Duo, it means you are about on par with quickman for single-core 32-bit performance which is great because you've got multithreading and quickman will need major refactoring to get there.

Once again I perceive a deficiency in Phenom performance -- it should be slaughtering my Core 2 Duo on this benchmark but it's only a little faster. I tried a little experiment: I changed the moves through [.mov12] usw. back to xmm2 in the thread_draw_sse2_amd version and it became slightly faster than even the thread_draw_sse2_intel version.

Perhaps with a 32-bit X3 version, maybe even the X2 version, there is already so much memory traffic that moving through memory saps performance. I suggest you make this change in your AMD version and obtain new timing results, especially for Sempron and Phenom.

BTW, have you considered an X2 or X3 version of the FPU benchmark? On processors with 128-bit wide execution paths it won't be competitive, of course, but on earlier processors such as P4 or Athlon 64 it might be. Naturally such a version could be enhanced to extended precision, allowing slighly tighter zooms than could be readily attained with SSE2.
Post 14 Jan 2008, 07:21
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
Maybe a stupid question, but what is FPU128 and were is it used? I'd like to think that using FPU32/64 might be faster even if you did multiple calculations to increase precision, but I don't even know what CPUs have it and how fast it really is. ^o)
Post 14 Jan 2008, 07:35
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd!
Quote:

Maybe a stupid question, but what is FPU128 and were is it used?

Are you referring to my remark:
Quote:

On processors with 128-bit wide execution paths it won't be competitive

? If so, what I was trying to express was that, on Core 2 Duo and Phenom the processor can crunch up a 128-bit wide floating point operation consisting of two double-precision operations or four single-precision operations as a single instruction. When such an instruction is issued to a P4 or an Athlon 64, it is broken into halves, each half being issued in a different clock cycle.

Double precision FPU code can thus at least in principle issue at the same rate as double precision SSE2 code on Athlon 64, although it may not be possible on P4 because all the floating point instructions have to go through port 1. With HT, maybe there is an extra port 1, though, so it may still be possible. It would be an interesting experiment, however not one that I myself am going to invest time in coding. If you want to try it all I can do is cheer you on from the sidelines.
Post 14 Jan 2008, 08:21
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
Xorpd! wrote:

Are you referring to my remark:
Quote:

On processors with 128-bit wide execution paths it won't be competitive

? If so, what I was trying to express was that, on Core 2 Duo and Phenom the processor can crunch up a 128-bit wide floating point operation consisting of two double-precision operations or four single-precision operations as a single instruction. When such an instruction is issued to a P4 or an Athlon 64, it is broken into halves, each half being issued in a different clock cycle.

No, I wasn't referring to your remark and I know about the wide execution on Core 2. I think I should express myself more clearly because the last time it sounded so "in context". Actually its a bit OT, but in http://cpuid.com/pcwizard.php there you find a table on the left with OS/HW supported and there's:
CPU Features

* MMX, 3DNow!, 3DNow! Enhanced, 3DNow! Pro
* SSE, SSE2, SSE3, S-SSE3
* SSE4a, SSE4.1, SSE4.2
* SSE5
* FPU128 ??? (Didn't find it with the help of Google...)

_________________
My updated idol Very Happy http://www.agner.org/optimize/
Post 14 Jan 2008, 08:53
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
Hi Xorpd!

You seem to have found the golden way for both worlds for the moment, exchanging the mem access with the registers helped really for both. Just for the Phenom I still got to wait for results...would be nice only to maintain one version Wink

AMD Sempron 1800 MHz: 220,340 (before 213,283)
Intel Core 2 Duo 1800 MHz: 1059,882 (before 1038,667)

I attached the latest version, called it 0.53E, just for evaluation, didn't include source code, got to clean everything later and put it on my webpage.

Regarding the efficiencies, yep, for Core 2 Duo the comparison to Quickman looks fine, just especially for AMD it looks quite strange to me. I put together a table on efficiencies (Iter/MHz/ for single core):

Image

You can see that it looks like the Quickman code still got the lead (of course per core only) even compared to 0.57 X4 version...I don't get it why...such huge impact by the iteration count or instruction interleave...?

@BitRake: Can you test the 0.53E on your Pentium M again ? Thanks

Regarding using the FPU...I also thought about paying some attention to that ancient Wink thing. There really could be somehow a X2 or X3 version as theoretically there are 8 registers I guess...just that awfull stack design, may be next X-Mas holiday Wink


Description:
Download
Filename: KMB_V0.53E_test.zip
Filesize: 2.76 KB
Downloaded: 64 Time(s)

Post 14 Jan 2008, 18:54
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2914
Location: [RSP+8*5]
bitRAKE
164.6 - very consistently.
Post 14 Jan 2008, 19:43
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
bitRAKE wrote:
164.6 - very consistently.
Uuups...we seem to have a Pentium M issue now, lower than before.

At least Phenom is hapy now: 1867.108 (Phenom 9600) instead of 1638.633 or 1847.717 like before...it seems one can't make everybody happy Wink
Post 14 Jan 2008, 20:49
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
1106.988 on T7200, but the result message's string says its V0.53D Wink This hasn't been fixed...

Quote:

it seems one can't make everybody happy

Its like a movie "recommended for all ages". Its usually performing very low...for every age Very Happy
Post 15 Jan 2008, 07:16
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd!
The AMD results aren't very good, are they? I looked at instruction_tables.pdf
Code:
movaps     xmm3, [.two]                ; [1] FMISC
mulpd        xmm3, xmm1                  ; [1] FMUL
mulpd         xmm1, xmm1                  ; [1] FMUL
mulpd         xmm3, xmm0                  ; [1] FMUL
mulpd         xmm0, xmm0                  ; [1] FMUL
movaps        xmm2, xmm0                  ; [2] FA/M
subpd         xmm0, xmm1                  ; [1] FADD
addpd    xmm1, xmm2                  ; [2] FADD
cmplepd  xmm1, [.local_radiant]      ; [2] FADD
addpd    xmm0, [.local_rz0_12]       ; [1] FADD
movmskpd edi, xmm1                   ; [3] FADD
paddd    xmm1,[.local_iter_count_12] ; [2] FA/M
movaps   [.local_iter_count_12],xmm1 ; [2] FMISC
movaps   xmm1, [.local_iz0]          ; [1] FMISC
addpd    xmm1, xmm3                  ; [1] FADD
test     edi,edi                     ; [3] ALU
jz       .continue_only_with_34_56   ; [3] ALU
    

In the above, instructions marked [1] are required for the basic Mandelbrot iteration, those marked [2] are required for updating the iteration counters, and those marked [3] are used to check whether the iteration is complete. The [3] instructions have been reduced in frequency by half due to unrolling, but the [2] instructions could also be reduced by unrolling if the scheme with 6 exit points were adopted.

In the [1] instructions we count 3 FADD, 4 FMUL, and 2 FMISC instructions. The [3] instructions have 1 FADD and 2 ALU, and the [2] instructions have 2 FADD, 2 FA/M, and 1 FMISC. According to Agner Fog, the FA/M instructions alternate between FADD and FMUL, so the 2 FA/M are equivalent to 1 FADD and 1 FMUL. All these instructions except the ALU ones are double-decode on Athlon 64, so the logjam on the FADD unit must take at least (3+1/2+3)*2 = 15 clocks to complete a loop trip which counts as two iterations, so that would be 15/2 = 7.5 clocks per iteration. Going back to your table a couple of posts up, we see you are measuring 1000/122 = 8.2 clocks per iteration. You are therefore close to the maximum throughput possible for the loop as written.

Now it may be clearer why quickman is doing better on AMD: due to unrolling with 4 exit points as they do, they are reducing the overhead of the [2] instructions as well as the [3] instructions. If you went to 6 exit points and unrolling by 2, you should get (3+1/2+3/2)*2 = 10 clocks for a best case loop trip, thus 10/2 = 5 clocks per iteration, with an efficiency of 1000/5 = 200. This is comparable to what quickman is getting.

The situation for Phenom is similar except that all instructions are now single-decode except for the store and the movaps xmm2, xmm0 instruction is now FANY. Best-case throughput should be something like 3+1/2+2.5 = 7 clocks per loop trip, or 7/2 = 3.5 clocks per iteration. In practice you are measuring 1000/203 = 4.9 clocks per iteration, so not as close to throughput-limited as was the case for the Athlon 64. Even so I have high hopes that unrolling by two with 6 exit points would also help appreciably with the Athlon, if you chose to implement it.
Post 15 Jan 2008, 08:56
View user's profile Send private message Visit poster's website Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd!
Oh, my! Where did I come up with (3+1/2+3)*2 = 15 or 3+1/2+2.5 = 7? I'm doing arithmetic like a mathematician here! Clearly there are more bubbles in the FADD pipeline than indicated in my previous post so the performance improvement possible if unrolling by two with 6 exit points is less than previously estimated. Still I do believe it to be significant.
Post 15 Jan 2008, 16:48
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
...I updated my webpage with some results off the 0.53E version. Especially interesting is the test on P4 with and without HT (Hyperthreading) on SSE2.

Before with the 'shitty' code of 0.53, HT gained about 20 % of speed, now with 0.53E there's no gain any more, even a small penalty. Of course overall a gain compared to 0.53.

Somehow it sounds logical, so bad code that doesn't use the cpu ports efficiently benefits from HT, good code doesn't, as HT can only create "artificall cores" as far as I understood it.

It's from a point of interest, as Intel's next Core 2 Duo based revision "Nehalem" will include again HT...we can guess now, how usefull it is or not...I guess for the optimized code here like X3/X4 versions it won't do much or even nothing...
Post 23 Jan 2008, 23:11
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
In the meantime Paul Gentieu released a new version of Quickman, supporting multicore now finally ! May be he takes back the performance crown Wink

It would be interesting to see results on AMD X2 or Core 2 Duo or Quad (don't have these on my hand, so if someone wants to test, would be nice, just set in his application 'Double precision', algoritm 'Intel or AMD exact' and Set MAX Iters, Mandel to 2048)

EDIT: Don't forget to set 'Threads' to lets say '16' for comparison. and may be it's easiest to use his 'bmark.log' and hit couple of times 'next'.

The link to his file: http://sourceforge.net/projects/quickman/

EDIT: Got some average results on Core 2 Duo, 1867 MHz:
Exact, AMD, 16 Threads: 867 MIter/s (Efficiency: 232)
Exact, Intel, 16 Threads: 916 MIter/s (Efficiency: 245)
Exact, Intel, 2 Threads: 942 MIter/s (Efficiency: 252)

So quite close to 0.53E (Efficiency: 280), 0.57 X4 (Efficiency: 361).

I try to get some insight how he implemented it and post it later.
EDIT: Paul said "As for multithreading- it was pretty simple. I just split the image up into "stripes" and give each thread several stripes to calculate (you can look in the source code for more info)."
Post 03 Feb 2008, 01:09
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
Finally I managed to release a new version of the benchmark called now 0.53F, you can get it from my webpage as usual in the x86 section or from the attachement here.

It was quite hard to implement these 6 exit solution, especially all the logic for end of the line and loop unrolling, but was worth it, up to 20 % faster than before. At first I tried as before with SSE2 instruction counters, where I had to use memory access due to lack of available registers. This was quite slow and not enhancing performance.

So now I do the whole iteration count with general purpose registers. 3 registers for 6 points, 2 counters in each register. So the counter is limited to 16 bits. This sets the maximum iteration limit to 32767 what's still okay, I guess, but once more the limitations of x32 coding shows up. I tried with 6 iteration counters in 32bit memory addresses, but that was really slow on Core 2 Duo. My old AMD Sempron didn't mind. Something learned again...though Phenom will behave probably more like Core 2 Duo as shown before in that case.

Here are the results for the SSE2 version:
V0.53F - Core 2 Duo (1867 MHz): 1252,933 MItPS (Efficiency: 335,5)
V0.53F - AMD Sempron (1800 MHz): 262,334 MItPS (Efficiency: 145,7)

Any comments or results of different CPU's welcome as usual.

I'll try to implement more ideas from Paul/Xorpd and also enhance the FPU version finally some day...

P.S.: Avira AntiVir detects my code as a TR/Crypt.XPACK.Gen virus...I checked my code on 'www.virustotal.com' and only AntiVir and Webwasher detected it...as I see it's a kind of common problem also with FASM and other apps here...just no idea what to do about it or do I really got a virus here !???


Description:
Download
Filename: KMB_V0.53F-32b-MT.zip
Filesize: 19.91 KB
Downloaded: 41 Time(s)

Post 30 Mar 2008, 21:40
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
Hm, the SSE2 version crashes on my Q6600...
Post 30 Mar 2008, 21:51
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3 ... 9, 10, 11 ... 18, 19, 20  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar.

Powered by rwasa.