flat assembler
Message board for the users of flat assembler.

Index > Windows > Mandelbrot Benchmark FPU/SSE2 released

Goto page Previous  1, 2, 3 ... 12, 13, 14 ... 18, 19, 20  Next
Author
Thread Post new topic Reply to topic
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 15 Apr 2008, 18:44
It takes a short while for them to power up though, so for benchmarks one should generally include some "burning rubber" warm-up code...
Post 15 Apr 2008, 18:44
View user's profile Send private message Visit poster's website Reply with quote
Ivan2k2



Joined: 08 Sep 2004
Posts: 80
Location: Russia, Angarsk
Ivan2k2 16 Apr 2008, 14:31
penryn t8100 2.1 GHz
fpu 439
sse 1372
Post 16 Apr 2008, 14:31
View user's profile Send private message ICQ Number Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20453
Location: In your JS exploiting you and your system
revolution 16 Apr 2008, 14:38
f0dder wrote:
It takes a short while for them to power up though, so for benchmarks one should generally include some "burning rubber" warm-up code...
This burn-code should be included in the test proggy, spend 1/2 a second (or so) burning your rubber[1] then start timing. If you do it outside with a different proggy then you might have to start playing with priorities to make sure your other proggy is not stealing clocks from your test proggy.

[1] NOT the U.S. meaning of 'rubber' okay, stop thinking that!
Post 16 Apr 2008, 14:38
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 16 Apr 2008, 16:42
Hm, is 1/2 second really enough with all these different technologies, speed step, cool and quiet, and whatever ? Isn't there no common way to tell the CPU that full speed is required...?

...how is the definition control for these things anyway, so who is telling the cpu that for example Excel is running at the moment but not used, e.g. ?

I guess the guys from the overclocking benchmarking faculty strip down their OS to a Minimum of tasks running and any energy saving completely off...
Post 16 Apr 2008, 16:42
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20453
Location: In your JS exploiting you and your system
revolution 16 Apr 2008, 16:56
Kuemmel wrote:
Hm, is 1/2 second really enough with all these different technologies, speed step, cool and quiet, and whatever ? Isn't there no common way to tell the CPU that full speed is required...?
1/2 a second (or so) should be plenty.
Kuemmel wrote:
...how is the definition control for these things anyway, so who is telling the cpu that for example Excel is running at the moment but not used, e.g. ?
The OS just uses the CPU use percentage to decide when to bump up/down a step.
Post 16 Apr 2008, 16:56
View user's profile Send private message Visit poster's website Reply with quote
Alphonso



Joined: 16 Jan 2007
Posts: 295
Alphonso 16 Apr 2008, 17:24
Kuemmel wrote:
Alphonso wrote:
PIII Tualatin 6-B-1 1Ghz, FPU 0.53G Speed 69.0
Hm, quite low, was expecting something like in the level of 90 or 100 as I thought the PIII design was almost same like Pentium-M, hm, got to read more about the design...
Well it is the cheaper Celeron type so I wouldn't expect the results to be as good as a 'real' PIII. Sorry, I should have labeled it a 'Celeron (PIII variant)' instead of PIII.
Post 16 Apr 2008, 17:24
View user's profile Send private message Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)
rugxulo 21 Apr 2008, 17:49
revolution wrote:

[1] NOT the U.S. meaning of 'rubber' okay, stop thinking that!


You'd have to ask Dex, but I think rubber also means eraser. Wink

Alphonso wrote:
Well it is the cheaper Celeron type so I wouldn't expect the results to be as good as a 'real' PIII. Sorry, I should have labeled it a 'Celeron (PIII variant)' instead of PIII.


Isn't the only difference like half the cpu cache of its big brother?
Post 21 Apr 2008, 17:49
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 21 Apr 2008, 18:06
...it doesn't seem to matter much if PIII or Celeron type, as you can see with another PIII test on my updated results list...as that small code runs totally in the 1st level cache anyway and doesn't care about much anything else Very Happy

...by the way, some findings on the results:

1.) Hyper Threading doesn't really help, when your compare the efficiencies shown...P4 just sucks on FPU and Hyper Threading doesn't help at all...so if you got more or less good optimized code also the future upgrade from Core 2 Duo to Nehalem won't do much good if they don't improve the Hyper Threading...

2.) Before Core 2 Duo the FPU of AMD was really good, now they didn't improve it for the Phenom and the Phenom is also bad regarding clock rate and SSE2 efficiency compared to Core 2 Duo, but of course may be a bit cheaper...

3.) Core 2 Duo took the advantage of the better FPU of the Pentium M and SSE2 was implemented superior to everything else.

4.) Will AMD ever have a chance to go back to the level of Core 2 Duo level ? It seems they are like way back in time in the cheap CPU segment...sad...
Post 21 Apr 2008, 18:06
View user's profile Send private message Visit poster's website Reply with quote
Alphonso



Joined: 16 Jan 2007
Posts: 295
Alphonso 22 Apr 2008, 17:18
rugxulo wrote:
Isn't the only difference like half the cpu cache of its big brother?
AFAIK the cache size is the same L2=256k except for the PIII-S which has 512k. The difference between the Celeron & normal PIII tualatin is that cache latency is supposedly one cycle slower in L2 and FSB is 100MHz vs 133MHz.
Kuemmel wrote:
...it doesn't seem to matter much if PIII or Celeron type
Ok, that's good, I just didn't want you chasing bogus info if it made a difference.
Post 22 Apr 2008, 17:18
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 27 Apr 2008, 22:00
I forgot to update my quad results for your newest release:
Q6600 (that is 2.4GHz for the lazys)
FPU(53G) 985.521
SSE(53G) 3030.716

btw, its has a new BIOS too, the 975XBX2 board I mean Smile How much it helps though...

Anyway on the topic of throttling CPUs. Its all good there and shouldn't matter. (The following is very dependent on the CPU at hand) It takes 10ns from C1 to C0. From C2 its about 100ns. The longest time to C0 what is known is C5 - around 200µs. That's far less than a start-up code would take. Which I'm more concerned about is the mode-switching. It takes a full two-second period on some GFX/LCD combinations and with SSE2 I can't even see a pixel drawn on the screen of the Mandelbrot before its finished and returned to desktop.

Does mode-switching intervene the timings?
Post 27 Apr 2008, 22:00
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20453
Location: In your JS exploiting you and your system
revolution 28 Apr 2008, 00:21
Madis731 wrote:
Anyway on the topic of throttling CPUs. Its all good there and shouldn't matter. (The following is very dependent on the CPU at hand) It takes 10ns from C1 to C0. From C2 its about 100ns. The longest time to C0 what is known is C5 - around 200µs. That's far less than a start-up code would take. Which I'm more concerned about is the mode-switching. It takes a full two-second period on some GFX/LCD combinations and with SSE2 I can't even see a pixel drawn on the screen of the Mandelbrot before its finished and returned to desktop.
Yes, it is quite fast for the hardware to switch power modes. But the main concern is whether the OS is fast enough to detect the need for a power change.
Post 28 Apr 2008, 00:21
View user's profile Send private message Visit poster's website Reply with quote
penang



Joined: 01 Oct 2004
Posts: 59
penang 29 Apr 2008, 12:12
[quote="Kuemmel"]Hi people,

due to a lot of help here and spending various hours looking on lots of source codes, finding fast routines, I coded a mandelbrot fractal benchmark to see how good or bad recent CPU's are on double precision calculation power. You can download the executable and source from:
[url]http://www.mikusite.de/pages/x86.htm[/url]
It's the 'KMB 0.3' file.

What is interesting, is that the FPU of the Pentium 4 is really bad, regarding the clock speed, but on SSE2 it flies. On the other hand, normalised to the clock speed the efficiency of AMD is the same. Other interesting finding is the good performance of the Pentium M and that AMD didn't change anything on their FPU since years, when I look at the result of my old Athlon.

Does anybody got PII or PIII or K6 to test ?

Would be interesting. The FPU code should work starting from Pentium Pro level (I used a command not in the instruction set of earlier Pentiums). The SSE2 of course only for compatible CPU's. Didn't implement some SSE2 detection, should be done in the future. The graphics display is written in DirectDraw and consumes less than 1 or 2% of the computation time.

Any comments or results welcome. It's my first ever x86 assembler coded stuff, so don't kick me too hard :wink:[/quote]


KMB_V0.53G-32b-MT_SSE2

Speed [Million Iterations / Second]: 594.315


KMB_V0.53G-32b-MT_FPU


Speed [Million Iterations / Second]: 303.145

CPU Pentium-D 920.
Post 29 Apr 2008, 12:12
View user's profile Send private message Reply with quote
penang



Joined: 01 Oct 2004
Posts: 59
penang 29 Apr 2008, 12:19
Just post the result on my old Pentium-D 920 machine.

I like fractals too. Wonder when you'd come out with a stand-alone version of the fractal program and let everybody play with it??
Post 29 Apr 2008, 12:19
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 29 Apr 2008, 13:03
Okay, tested - had a "normal-priority" program run 100% in the background and then added benchmark to this. No measurable difference.

Maybe slightly higher: 503.084/1564.731 (+0.4%/+1.5%)
Post 29 Apr 2008, 13:03
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 29 Apr 2008, 21:29
penang wrote:
Just post the result on my old Pentium-D 920 machine.

I like fractals too. Wonder when you'd come out with a stand-alone version of the fractal program and let everybody play with it??

Hm, your result seems to be very slow...does your cpu run at 2,8 MHz ? Any applications running on the side ?

Regarding a stand-alone version, just use Quickman ->
http://sourceforge.net/projects/quickman/

It's almost as fast as my code and quite nice interface...
Post 29 Apr 2008, 21:29
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 29 Apr 2008, 22:04
2.5GHz exactly!
Post 29 Apr 2008, 22:04
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
penang



Joined: 01 Oct 2004
Posts: 59
penang 02 May 2008, 14:23
[quote="Kuemmel"][quote="penang"]Just post the result on my old Pentium-D 920 machine.

I like fractals too. Wonder when you'd come out with a stand-alone version of the fractal program and let everybody play with it??[/quote]
Hm, your result seems to be very slow...does your cpu run at 2,8 MHz ? Any applications running on the side ?

Regarding a stand-alone version, just use Quickman ->
[url]http://sourceforge.net/projects/quickman/[/url]

It's almost as fast as my code and quite nice interface...[/quote]


Hmmm.... I was wondering that too. Nothing running on the side, except of course, the TCPIP thing for online purpose, and firefox.
Post 02 May 2008, 14:23
View user's profile Send private message Reply with quote
penang



Joined: 01 Oct 2004
Posts: 59
penang 02 May 2008, 14:53
[quote="Kuemmel"]

Regarding a stand-alone version, just use Quickman ->
[url]http://sourceforge.net/projects/quickman/[/url]

It's almost as fast as my code and quite nice interface...[/quote]


It's many magnitude slower than your code, man. Just tried it on my old Pentium-D machine, the best it can get is 1834M Iterations / second.

When you manage to get yours into the stand-alone style, please tell me. Wanna try yours out too !! :)

Thanks in advance !!
Post 02 May 2008, 14:53
View user's profile Send private message Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 01 Jun 2008, 12:24
Finally after more or less 20 versions of testing for another enhanced FPU version and finding the optimum of loop unrolling I can release a new version called V0.53H (As usuall on http://www.mikusite.de/pages/x86.htm).

The speed up is up to 40 % of the FPU compared to the last one and up to 14 % for the SSE2. Changes:

- FPU: optimized iteration code, 3 independent points iteration, 3 times loop unrolling.
- SSE2: just loop unrolling changed from 1 time to 3 times.

For the SSE2 version the further loop unrolling wasn't liked by Pentium-M and Core1 CPU's, so I included also the former version and called it 'V0.53H_SSE2_PM".

For the FPU version I have to revise my former statements that a 3 independent points loop isn't good...after optimising the basic iteration code and going to further loop unrolling it proves to be better. It really flies on Core2Duo just proving what Agner Fod says in his manuals "The capacity of the execution ports and execution units is impressive."...and showing also that AMD Phenom isn't improved on FPU refering to AMD 64/Athlon design, just SSE2 was enhanced.

I'll soon make some graphs about what has been achieved compared to my very first non-optimized single iteration loop. For Core2Duo I'm already at a plus of 120 to 150 %...it seems to have so much reserves when the coding is done right both on FPU and SSE2. I think it's really worth for all old floating point code to be looked at and probably changed. As one can see on the Pentium III results the enhancements don't show off on that old architecture much.
Post 01 Jun 2008, 12:24
View user's profile Send private message Visit poster's website Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd! 02 Jun 2008, 01:09
KMB_V0.53H-32b-MT_FPU.exe 783.839
KMB_V0.53H-32b-MT_SSE2.exe 1838.962
KMB_V0.53H-32b-MT_SSE2_PM.exe 1791.471
All on my Core 2 Duo E6700 2.66 GHz 1 socket 2 cores XP-x64.

Quite impressive that you got the FPU speed above what the SSE2 speed was when I first looked at your benchmark, late 2006. I know how hard it is to actually implement some of the ideas you have included in your latest efforts. Even harder given that you are working with fewer registers. The Core 2 Duo is indeed powerful, but as we know quite challenging to get it operating at anywhere near its full potential.
Post 02 Jun 2008, 01:09
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3 ... 12, 13, 14 ... 18, 19, 20  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.