flat assembler
Message board for the users of flat assembler.

Index > Windows > Mandelbrot Benchmark FPU/SSE2 released

Goto page Previous  1, 2, 3 ... 14, 15, 16 ... 18, 19, 20  Next
Author
Thread Post new topic Reply to topic
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
bitRAKE wrote:
Currently, i'm using the on board video, and doubt that has any effect on the results. My guess would be memory contention of the thread data between the two cpus. This could be easily tested by having threads select a data area based on which cpu is being used - cacheline aligned and all that goodness. Eh, I'm lazy though, so maybe just 16 copies of the work you've already done - should see a change. Dirty cachelines going across the bus has to slow things down.

...hm, okay. Just for a test, can you run my very old slow version and post the results ? It's there:
http://www.mikusite.de/x86/KMB_V0.53_MT.zip
I did the threading and drawing a bit different...may be that gives a hint...
Post 04 Aug 2008, 19:09
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2887
Location: [RSP+8*5]
bitRAKE
Kuemmel wrote:
Just for a test, can you run my very old slow version and post the results ? It's there: http://www.mikusite.de/x86/KMB_V0.53_MT.zip
I did the threading and drawing a bit different...may be that gives a hint...
( 1125.018 FPU, 2314.821 SSE2 ) Aren't all threads still writing to the same global data? Are you able to discern anything additional from these data points? All cores hit 100% as before.

Also, I ran your ten times version from the 2cpu.com thread: ( 2319.196 FPU, 4241.355 SSE2 ). All cores very close to 100% for the complete duration. I'm tuning up (i.e. turning off) Vista and only expect those numbers to improve.

I'll create a test of the current version the day after tomorrow.

_________________
¯\(°_o)/¯ unlicense.org
Post 05 Aug 2008, 03:41
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
Sorry for OT, but: is there any particular reason you got that monster machine, bitRAKE? Or is it "because I could"? Smile
Post 05 Aug 2008, 04:20
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2887
Location: [RSP+8*5]
bitRAKE
f0dder wrote:
Sorry for OT, but: is there any particular reason you got that monster machine, bitRAKE? Or is it "because I could"? Smile
My serious play is better served by more processing power. Smile So, the answer is "yes".

_________________
¯\(°_o)/¯ unlicense.org
Post 05 Aug 2008, 05:39
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
bitRAKE wrote:
( 1125.018 FPU, 2314.821 SSE2 ) Aren't all threads still writing to the same global data? Are you able to discern anything additional from these data points? All cores hit 100% as before.

Also, I ran your ten times version from the 2cpu.com thread: (2319.196 FPU, 4241.355 SSE2 ). All cores very close to 100% for the complete duration. I'm tuning up (i.e. turning off) Vista and only expect those numbers to improve.

I'll create a test of the current version the day after tomorrow.

Hm, at least your cpu's show 100% load compared to the 70% of the 2cpu.com guy. It seems that the later version is less efficient, may be it's really the changed screen drawing and threading. I will try to make some test versions may be without graphics output...just I might need about 4 weeks as I'll going on holiday Smile ...in the meantime also feel free to experiment with my code in case you want to...it's just weird that these things happen only with 2 cpu's, with 1 cpu even with 4 cores everything seems okay...
Post 05 Aug 2008, 16:49
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
Maybe the new Nehalem will give us answers, when memory controller is brought to the chip and memory-based operations are faster. Q4 will be interesting as anandtech "promises" the same WOW! effect that Core 2 did.
Post 05 Aug 2008, 17:27
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2887
Location: [RSP+8*5]
bitRAKE
Finally got the computer all together and development software moved over. (No more complaints about taking up the kitchen table.) My previous comments are in error - now I think it has more to do with Windows threading than anything happening with the memory. Was able to get a substantial (imho) 11% improvement with only 8 threads and an interleave of two:
Code:
2249.829 / 4190.117 ; 16 threads, interleave 1
2329.580 / 4276.216 ; 16 threads, interleave 2
2392.059 / 4459.482 ;  8 threads, interleave 1
2429.290 / 4676.610 ;  8 threads, interleave 2    
I was curious if this had something to do with running 32-bit code on a 64-bit OS. So, I did a comparison with Xorpd!'s modified x4 algo:
Code:
4845.812 - 16 threads, interleave 1
4771.547 - 16 threads, interleave 2
5083.158 - 8 threads, interleave 1
5001.501 - 8 threads, interleave 2    
...both thread count and interleave seem to have less of an impact.

Quickman bench.log:
Code:
3.567s - 4 threads
1.835s - 8 threads
2.031s - 16 threads
2.014s - 32 threads    
...says about ~50 GFlops double percision.

*16 threads actually run faster without SetThreadAffinityMask!

_________________
¯\(°_o)/¯ unlicense.org
Post 07 Aug 2008, 23:10
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
bitRAKE wrote:
2329.580 / 4276.216 ; 16 threads, interleave 2
2392.059 / 4459.482 ; 8 threads, interleave 1
2429.290 / 4676.610 ; 8 threads, interleave 2

...hm, very interesting !!! Did you check if a higher interleave, like 10 or so has an even better effect ?
Post 13 Aug 2008, 03:47
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2887
Location: [RSP+8*5]
bitRAKE
Kuemmel wrote:
bitRAKE wrote:
2329.580 / 4276.216 ; 16 threads, interleave 2
2392.059 / 4459.482 ; 8 threads, interleave 1
2429.290 / 4676.610 ; 8 threads, interleave 2

...hm, very interesting !!! Did you check if a higher interleave, like 10 or so has an even better effect ?
...only gets worse.

Edit: just rechecked...have many processes running, too:
Code:
; 4247.126 ; 16 threads, interleave 1
; 4396.672 ; 16 threads, interleave 2
; 4427.855 ; 16 threads, interleave 3
; 4459.482 ; 16 threads, interleave 4
; 4427.854 ; 16 threads, interleave 5
; 4427.855 ; 16 threads, interleave 6
; 4427.855 ; 16 threads, interleave 8
; 4365.926 ; 16 threads, interleave 10
; 3865.805 ; 16 threads, interleave 25

; 4335.608 ; 8 threads, interleave 1
; 4557.135 ; 8 threads, interleave 2
; 4524.112 ; 8 threads, interleave 3
; 4540.564 ; 8 threads, interleave 4
; 4524.112 ; 8 threads, interleave 5
; 4491.565 ; 8 threads, interleave 6
; 4491.565 ; 8 threads, interleave 8
; 4459.482 ; 8 threads, interleave 10
; 4135.619 ; 8 threads, interleave 25    
...hopefully a better picture of how it changes.

Edit Again: Sorry, those figures (new ones above) had SetThreadAffinityMask commented throughout!

Using SetThreadAffinityMask:
Code:
; 4396.672 ; 8 threads, interleave 1
; 4607.583 ; 8 threads, interleave 2
; 4607.583 ; 8 threads, interleave 3
; 4607.583 ; 8 threads, interleave 4
; 4607.583 ; 8 threads, interleave 5
; 4607.583 ; 8 threads, interleave 6
; 4524.112 ; 8 threads, interleave 8
; 4507.780 ; 8 threads, interleave 10
; 4190.117 ; 8 threads, interleave 25    

_________________
¯\(°_o)/¯ unlicense.org
Post 13 Aug 2008, 04:03
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
Got first Via Nano (1.8 GHz) result:

FPU: 112,380 MIter (Efficiency: 62,4)
SSE2: 379,070 MIter (Efficiency: 210,6)

...not bad, I would say, especially SSE2 compared that old VIA stuff...
Post 20 Aug 2008, 05:07
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
Has anyone got Atom - I can't find any Eee PC 900 having that. They keep coming with Celerons and whatnot.
EDIT: Oh, I just found N270 entries on Kümmel's site so no prob Wink

Btw, here are E8400@3GHz times with XP 32-bit:
Code:
FPU      857.004 / Eff. 142.8
SSE2    2010.717 /Eff. 335.1
SSE2PM  1926.937 /Eff. 321.2
    

And the Q6600@2.4GHz with Server 2003 64-bit:
Code:
FPU     1363.161 / Eff. 142.0
SSE2    3121.637 / Eff. 325.2
SSE2PM  3030.716 / Eff. 315.7
    
Post 20 Aug 2008, 10:34
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 604
Location: Germany
MCD
The reason why AMD CPUs are so bad with in executing this benchmark is because they are pretty much more optimized for 3DNow instead of SSE, even the newer ones that got SSE3 and SSSE3.

So I would like to see someone making a comparative benchmark, one that uses the SSE1/2/3 code for Intel CPUs and the 3DNow!,3DNow!+,MMX,MMX+ code(you can take my mandelbrot benchmark code which I have posted earlier in this thread for that) for the AMD CPUs. Does anyone have enough of both CPUs brands?
Post 25 Aug 2008, 04:29
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
@MCD: The problem is that AMD's 3DNow! (jeesh how hard it is to type this thing Razz no-caps, caps, caps, no-caps, no-caps, caps) is using only 64-bit datatype - same as MMX and this means it has to be at least 2xfaster than Intel on any SSE calculations, to be faster overall...
I think if we can prove that AMD's 3DNow! can be about half the speed of Intel SSE, then we can agree that they have done a good job!
Post 25 Aug 2008, 08:45
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2887
Location: [RSP+8*5]
bitRAKE
Wow, big change with new video card:

0.53H-32b-MT_FPU : 2562.961
0.53H-32b-MT_SSE2 : 5525.022

(with 8 threads 5889.822)

_________________
¯\(°_o)/¯ unlicense.org
Post 17 Sep 2008, 04:52
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
Hi guys,

I got the first i7 Nehalem result (Intel Core i7 920 / 4000 Mhz / 4 Cores / HT on):
FPU: 2221,806 MIter/s - Efficiency: 138,9
SSE2: 6151,010 MIter/s - Efficiency: 384,4

It was just one run, so may be some inaccuracy is possible, anyway, the result means about +13% for SSE2 and a -4% for FPU compared to same clocked Core2Duo.

The FPU results seems a bit strange, but the SSE2 is more or less the level of other floating point intense benchmarks...so another nice achievement by Intel...and with memory intense benches that thing seems to fly...I guess spring'09 is time for shopping Wink

@Bitrake, just discovered your mail now, really interesting with the graphics card, that bottleneck didn't show up at all on 1 CPU systems...strange.
Post 13 Nov 2008, 19:39
View user's profile Send private message Visit poster's website Reply with quote
Ivan2k2



Joined: 08 Sep 2004
Posts: 80
Location: Russia, Angarsk
Ivan2k2
p8400 - 2.26 GHz - vista 32bit
sse2 1508.037
fpu 653.404
Post 14 Nov 2008, 11:24
View user's profile Send private message ICQ Number Reply with quote
adnimo



Joined: 18 Jul 2008
Posts: 49
adnimo
Kuemmel wrote:
adnimo wrote:
Did you guys benchmark on a P2? I have one dusting at home it's a 333mhz Pentium II, I could set it up if there's still a need for it.

Hi Adnimo,

if you got time, why not...just I hope you can have WinXP running on it, if that's possible at all ? ...because I got one report with an Pentium II and Win98 failing the bench to run...


Sorry for the delay!

I tried it on the P2 today, and sadly it's running on 9x - the screen just went black and I can see the hourglass cursor but nothing seems to be going on... in fact I couldn't return to the system, oh well.

how long do you think it would take to benchmark an athlon xp 2600, I could try on that one (didn't see any on your table)
Post 17 Nov 2008, 05:17
View user's profile Send private message Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
...no problem for beeing late...seems to be a problem with the OS with the PII, I guess.

Athlon xp 2600 would be interesting ! From my Athlon result I would think it delivers a result of about 227,xxx MIter/s for FPU. SSE2 isn't supported anyway.
Post 17 Nov 2008, 18:00
View user's profile Send private message Visit poster's website Reply with quote
adnimo



Joined: 18 Jul 2008
Posts: 49
adnimo
How long do you think it would take to finish the benchmark on that Athlon?

Regarding the OS issue, I don't have any spare license of XP so I can't really do much on that side, sadly.
Post 18 Nov 2008, 11:49
View user's profile Send private message Reply with quote
kalambong



Joined: 08 Nov 2008
Posts: 165
kalambong
have you tried running the benchmark on the new Windows 7 beta??
Post 25 Nov 2008, 03:53
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3 ... 14, 15, 16 ... 18, 19, 20  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar.

Powered by rwasa.