flat assembler
Message board for the users of flat assembler.

Index > Windows > Mandelbrot Benchmark FPU/SSE2 released

Goto page Previous  1, 2, 3 ... 13, 14, 15 ... 18, 19, 20  Next
Author
Thread Post new topic Reply to topic
bitRAKE



Joined: 21 Jul 2003
Posts: 2940
Location: vpcmipstrm
bitRAKE
Kümmel Mandelbrot Benchmark V 0.53H-32b-MT_FPU
Speed [Million Iterations / Second] : 191.600

Kümmel Mandelbrot Benchmark V 0.53H-32b-MT_SSE2
Speed [Million Iterations / Second] : 177.114

Kümmel Mandelbrot Benchmark V 0.53H-32b-MT_SSE2_PM
Speed [Million Iterations / Second] : 178.890

Not only impressive, but personally unexpected FPU performance!
(1.6Ghz Pentium M (Dothan) as before, but haven't rebooted in maybe a month - usually effects results).

_________________
¯\(°_o)/¯ unlicense.org
Post 02 Jun 2008, 01:58
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2140
Location: Estonia
Madis731
Liek WOW!

EDIT: Sorry, I lied Razz I first tested a T7200 and thought it was a T9300, here are the corrected stats:
Code:
;T7200 / 64-bit 2003 Server / 1gig of RAM / integrated graphics
Kümmel Mandelbrot Benchmark V 0.53H-32b-MT_FPU
Speed [Million Iterations / Second] : 571,990

Kümmel Mandelbrot Benchmark V 0.53H-32b-MT_SSE2
Speed [Million Iterations / Second] : 1338,323

Kümmel Mandelbrot Benchmark V 0.53H-32b-MT_SSE2_PM
Speed [Million Iterations / Second] : 1304,760

FPU eff. like on your homepage: 142,998
SSE2 eff. 334,581
SSE2:FPU == 2,34:1
    

Now the real T9300:
Code:
;T9300 / 64-bit 2003 Server / 4gigs of RAM / integrated graphics
Kümmel Mandelbrot Benchmark V 0.53H-32b-MT_FPU
Speed [Million Iterations / Second] : 697,183

Kümmel Mandelbrot Benchmark V 0.53H-32b-MT_SSE2
Speed [Million Iterations / Second] : 1607,021

Kümmel Mandelbrot Benchmark V 0.53H-32b-MT_SSE2_PM
Speed [Million Iterations / Second] : 1549,200

FPU eff. like on your homepage: 139,437
SSE2 eff. 321,404
SSE2:FPU == 2,305:1
    


Last edited by Madis731 on 02 Jun 2008, 11:46; edited 2 times in total
Post 02 Jun 2008, 09:06
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Ivan2k2



Joined: 08 Sep 2004
Posts: 80
Location: Russia, Angarsk
Ivan2k2
penryn t8100, vista 32bit sp1

fpu - 609
sse2 - 1430
sse2pm - 1372
Post 02 Jun 2008, 10:08
View user's profile Send private message ICQ Number Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
...thanks for all the testing ! All in line with my other results !

Here are some comparison graphs to see what was achieved now with the latest evolution KMB V0.53H (at SSE2 'G' for Pentium M) compared to my very first non optimized code with single instruction lines, no different exits, no loop unrolling, released almost 2 years before (KMB V0.53):

FPU-Version:
Image
The verdict here is, that I thought at first neither Intel or AMD improved their FPU and all were the same level except Pentium 4...what was clearly wrong after seeing the latest results. Intel did a hell of improvement with the Core2Duo when you find out what this cpu needs...different instruction lines and loop unrolling to make full use of the out of order architecture and those execution units. Except of the 4 cores Phenom lacks of any improvement.

SSE2-Version:
Image
Again Core2Duo with the lead, AMD couldn't keep up even with the same extension to the 128bit SSE2 bandwith, but still of course much better than AMD 64 design. Strange that Pentium M is even a little slower compared to FPU version.

I'm really keen on seeing results now of the upcoming stuff like VIA Nano, Intel Core2Duo Nehalem (Hyperthreading) and long time later Core2Duo Sandy with 256bit SSE2 bandwith...in the meantime I still search for a result for Pentium 4 with Hyperthreading to indicate again the benefit of it. I guess with non optimized code Hyperthreading would help with Core2Duo.
Post 02 Jun 2008, 18:23
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
A guy with a 16 core (4 CPU Quad Core) was testing my benchmark, have a look at:
http://forums.2cpu.com/showthread.php?t=76178&page=8

Problem is even when I made my benchmark 10 times lasting longer his cores are not utilized to full extend. Though I think everything works fine with single cpu quad cores.

Any clues from you guys...some problem with the threading code here or whatever with the 4 cpu machine ? These problems where seen before on some dual cpu machines...
Post 01 Jul 2008, 22:51
View user's profile Send private message Visit poster's website Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)
rugxulo
Quote:

v0.53H 32bit
SSE2 9755.117
FPU 4858.580

System spec 4 Intel E7330 quad core 2.4Ghz, Win2k3 x64 SP2, 32GB of ram (running at 677).


Holy crap, Batman! Shocked (And you're wondering why it isn't faster??)
Post 07 Jul 2008, 09:16
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17473
Location: In your JS exploiting you and your system
revolution
Kuemmel wrote:
Problem is even when I made my benchmark 10 times lasting longer his cores are not utilized to full extend. Though I think everything works fine with single cpu quad cores.

Any clues from you guys...some problem with the threading code here or whatever with the 4 cpu machine ? These problems where seen before on some dual cpu machines...
I expect you may be hitting the memory bottleneck. 16 cores all trying to access the memory will easily saturate the FSB. Wait for the Nehalem with 3 memory controllers, that might help to improve things.
Post 07 Jul 2008, 10:03
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Sorry for posting without checking but does your program writes to the video buffer directly? Since writes to video card memory must not be cached it is possible that the reason is what revolution says, otherwise the cache memory of each core should help to prevent such memory bottlenecks.
Post 07 Jul 2008, 20:41
View user's profile Send private message Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
...thanks for the comments...no clue yet. I don't write directly to the video buffer, but also in the past doing it or not didn't have any effect...so I still wait for some tests of the guy with the huge machine...until now at least with a single quad core there was no trouble at all and all cores at 100 % load...

...in the meantime I got also a result for the Intel Atom on my webpage...what a huge step back in CPU technology...okay, wasn't meant to be very good and to save power, but still...why go back to this in-order-architecture...is that really the point to save power consumption !???
Post 10 Jul 2008, 18:21
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17473
Location: In your JS exploiting you and your system
revolution
Kuemmel wrote:
in-order-architecture...is that really the point to save power consumption !???
Yes. All those extra transistors used to keep track of the instructions need power to work. Not every application needs a super-fast CPU to do it's job.
Post 11 Jul 2008, 00:41
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
...well, for me the Atom is really a joke...on (sorry, it's german) one can see that an AMD 64 with 1 GHz is faster and consumes less power:
http://www.tomshardware.com/de/athlon-2000-Atom-230-Undervolting,testberichte-240084.html

The real star of energy saving platform is probably the Tegra platform (Nvidia + ARM11), if somebody would made a small notebook with it ...sorry for being off topic Wink:
http://sg.nvidia.com/page/handheld.html
Post 11 Jul 2008, 17:09
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2940
Location: vpcmipstrm
bitRAKE
Might I suggest another project of a similar nature?

Barnes-Hut N-body algorithm would be a interesting challenge/benchmark.

_________________
¯\(°_o)/¯ unlicense.org
Post 15 Jul 2008, 05:57
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
bitRAKE wrote:
Might I suggest another project of a similar nature?

Barnes-Hut N-body algorithm would be a interesting challenge/benchmark.

...why not...I googled a bit around, found some stuff about gravity attraction like stars or something regarding that n-body thing...do you know any good C-code implementation with visualisation to start with !?

...and yes, I'm still thinking what to code next, I found also some nice stuff like singularities: http://www.imaginary2008.de/surfer.php
(just a small formula describes these surfaces)
...though I wonder if it's visualized by a raytracing algorithm, anybody ever did a simple raytracer in ASM ?
Post 15 Jul 2008, 21:55
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
So they undervolt the AMD64 but keep the ATOM running at stock voltage? Perhaps the ATOM could run undervolted as well? And what about long-term stability? Interesting test anyway, too bad it's in german (WHEN will people learn to only publish in English? Wink).

I still think the idea behind the ATOM is OK, and considering it's basically first-gen, it's not too bad. With a 2nd-gen ATOM (we'll see...) and a more optimized chipset, it doesn't seem like a bad idea to me. Also, in-order CPUs are easier to hand-optimize for than OOO.
Post 16 Jul 2008, 00:15
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2940
Location: vpcmipstrm
bitRAKE
Kuemmel wrote:
bitRAKE wrote:
Barnes-Hut N-body algorithm would be a interesting challenge/benchmark.

...why not...I googled a bit around, found some stuff about gravity attraction like stars or something regarding that n-body thing...do you know any good C-code implementation with visualisation to start with !?
http://gravit.slowchop.com/

http://www.amara.com/papers/nbody.html

_________________
¯\(°_o)/¯ unlicense.org
Post 16 Jul 2008, 18:08
View user's profile Send private message Visit poster's website Reply with quote
adnimo



Joined: 18 Jul 2008
Posts: 49
adnimo
Did you guys benchmark on a P2? I have one dusting at home it's a 333mhz Pentium II, I could set it up if there's still a need for it.
Post 31 Jul 2008, 14:00
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2940
Location: vpcmipstrm
bitRAKE
2x L5410 (8 cores)

4190.117 - SSE2
2312.324 - FPU
(running on Vista x64)

Should add a column in the stats for power efficiency (Million Itterations / Watts TDP). Very Happy

_________________
¯\(°_o)/¯ unlicense.org
Post 02 Aug 2008, 09:07
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
adnimo wrote:
Did you guys benchmark on a P2? I have one dusting at home it's a 333mhz Pentium II, I could set it up if there's still a need for it.

Hi Adnimo,

if you got time, why not...just I hope you can have WinXP running on it, if that's possible at all ? ...because I got one report with an Pentium II and Win98 failing the bench to run...
Post 03 Aug 2008, 17:10
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
bitRAKE wrote:
2x L5410 (8 cores)

4190.117 - SSE2
2312.324 - FPU
(running on Vista x64)

Should add a column in the stats for power efficiency (Million Itterations / Watts TDP). Very Happy

...regarding the efficiency per core per MHz this machine seems to have the same problem like the one I reported here before on that www.2cpu.com forum. Do you have any conclusions why it's full potential (100% load) isn't used on that 2 cpu machines !?
Post 03 Aug 2008, 19:56
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2940
Location: vpcmipstrm
bitRAKE
Currently, i'm using the on board video, and doubt that has any effect on the results. My guess would be memory contention of the thread data between the two cpus. This could be easily tested by having threads select a data area based on which cpu is being used - cacheline aligned and all that goodness. Eh, I'm lazy though, so maybe just 16 copies of the work you've already done - should see a change. Dirty cachelines going across the bus has to slow things down.

_________________
¯\(°_o)/¯ unlicense.org
Post 04 Aug 2008, 04:20
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3 ... 13, 14, 15 ... 18, 19, 20  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on YouTube, Twitter.

Website powered by rwasa.