flat assembler
Message board for the users of flat assembler.
Index
> Windows > Mandelbrot Benchmark FPU/SSE2 released Goto page Previous 1, 2, 3 ... 12, 13, 14 ... 18, 19, 20 Next |
Author |
|
f0dder 15 Apr 2008, 18:44
It takes a short while for them to power up though, so for benchmarks one should generally include some "burning rubber" warm-up code...
|
|||
15 Apr 2008, 18:44 |
|
Ivan2k2 16 Apr 2008, 14:31
penryn t8100 2.1 GHz
fpu 439 sse 1372 |
|||
16 Apr 2008, 14:31 |
|
Kuemmel 16 Apr 2008, 16:42
Hm, is 1/2 second really enough with all these different technologies, speed step, cool and quiet, and whatever ? Isn't there no common way to tell the CPU that full speed is required...?
...how is the definition control for these things anyway, so who is telling the cpu that for example Excel is running at the moment but not used, e.g. ? I guess the guys from the overclocking benchmarking faculty strip down their OS to a Minimum of tasks running and any energy saving completely off... |
|||
16 Apr 2008, 16:42 |
|
revolution 16 Apr 2008, 16:56
Kuemmel wrote: Hm, is 1/2 second really enough with all these different technologies, speed step, cool and quiet, and whatever ? Isn't there no common way to tell the CPU that full speed is required...? Kuemmel wrote: ...how is the definition control for these things anyway, so who is telling the cpu that for example Excel is running at the moment but not used, e.g. ? |
|||
16 Apr 2008, 16:56 |
|
Alphonso 16 Apr 2008, 17:24
Kuemmel wrote:
|
|||
16 Apr 2008, 17:24 |
|
rugxulo 21 Apr 2008, 17:49
revolution wrote:
You'd have to ask Dex, but I think rubber also means eraser. Alphonso wrote: Well it is the cheaper Celeron type so I wouldn't expect the results to be as good as a 'real' PIII. Sorry, I should have labeled it a 'Celeron (PIII variant)' instead of PIII. Isn't the only difference like half the cpu cache of its big brother? |
|||
21 Apr 2008, 17:49 |
|
Kuemmel 21 Apr 2008, 18:06
...it doesn't seem to matter much if PIII or Celeron type, as you can see with another PIII test on my updated results list...as that small code runs totally in the 1st level cache anyway and doesn't care about much anything else
...by the way, some findings on the results: 1.) Hyper Threading doesn't really help, when your compare the efficiencies shown...P4 just sucks on FPU and Hyper Threading doesn't help at all...so if you got more or less good optimized code also the future upgrade from Core 2 Duo to Nehalem won't do much good if they don't improve the Hyper Threading... 2.) Before Core 2 Duo the FPU of AMD was really good, now they didn't improve it for the Phenom and the Phenom is also bad regarding clock rate and SSE2 efficiency compared to Core 2 Duo, but of course may be a bit cheaper... 3.) Core 2 Duo took the advantage of the better FPU of the Pentium M and SSE2 was implemented superior to everything else. 4.) Will AMD ever have a chance to go back to the level of Core 2 Duo level ? It seems they are like way back in time in the cheap CPU segment...sad... |
|||
21 Apr 2008, 18:06 |
|
Alphonso 22 Apr 2008, 17:18
rugxulo wrote: Isn't the only difference like half the cpu cache of its big brother? Kuemmel wrote: ...it doesn't seem to matter much if PIII or Celeron type |
|||
22 Apr 2008, 17:18 |
|
Madis731 27 Apr 2008, 22:00
I forgot to update my quad results for your newest release:
Q6600 (that is 2.4GHz for the lazys) FPU(53G) 985.521 SSE(53G) 3030.716 btw, its has a new BIOS too, the 975XBX2 board I mean How much it helps though... Anyway on the topic of throttling CPUs. Its all good there and shouldn't matter. (The following is very dependent on the CPU at hand) It takes 10ns from C1 to C0. From C2 its about 100ns. The longest time to C0 what is known is C5 - around 200µs. That's far less than a start-up code would take. Which I'm more concerned about is the mode-switching. It takes a full two-second period on some GFX/LCD combinations and with SSE2 I can't even see a pixel drawn on the screen of the Mandelbrot before its finished and returned to desktop. Does mode-switching intervene the timings? |
|||
27 Apr 2008, 22:00 |
|
revolution 28 Apr 2008, 00:21
Madis731 wrote: Anyway on the topic of throttling CPUs. Its all good there and shouldn't matter. (The following is very dependent on the CPU at hand) It takes 10ns from C1 to C0. From C2 its about 100ns. The longest time to C0 what is known is C5 - around 200µs. That's far less than a start-up code would take. Which I'm more concerned about is the mode-switching. It takes a full two-second period on some GFX/LCD combinations and with SSE2 I can't even see a pixel drawn on the screen of the Mandelbrot before its finished and returned to desktop. |
|||
28 Apr 2008, 00:21 |
|
penang 29 Apr 2008, 12:12
[quote="Kuemmel"]Hi people,
due to a lot of help here and spending various hours looking on lots of source codes, finding fast routines, I coded a mandelbrot fractal benchmark to see how good or bad recent CPU's are on double precision calculation power. You can download the executable and source from: [url]http://www.mikusite.de/pages/x86.htm[/url] It's the 'KMB 0.3' file. What is interesting, is that the FPU of the Pentium 4 is really bad, regarding the clock speed, but on SSE2 it flies. On the other hand, normalised to the clock speed the efficiency of AMD is the same. Other interesting finding is the good performance of the Pentium M and that AMD didn't change anything on their FPU since years, when I look at the result of my old Athlon. Does anybody got PII or PIII or K6 to test ? Would be interesting. The FPU code should work starting from Pentium Pro level (I used a command not in the instruction set of earlier Pentiums). The SSE2 of course only for compatible CPU's. Didn't implement some SSE2 detection, should be done in the future. The graphics display is written in DirectDraw and consumes less than 1 or 2% of the computation time. Any comments or results welcome. It's my first ever x86 assembler coded stuff, so don't kick me too hard :wink:[/quote] KMB_V0.53G-32b-MT_SSE2 Speed [Million Iterations / Second]: 594.315 KMB_V0.53G-32b-MT_FPU Speed [Million Iterations / Second]: 303.145 CPU Pentium-D 920. |
|||
29 Apr 2008, 12:12 |
|
penang 29 Apr 2008, 12:19
Just post the result on my old Pentium-D 920 machine.
I like fractals too. Wonder when you'd come out with a stand-alone version of the fractal program and let everybody play with it?? |
|||
29 Apr 2008, 12:19 |
|
Madis731 29 Apr 2008, 13:03
Okay, tested - had a "normal-priority" program run 100% in the background and then added benchmark to this. No measurable difference.
Maybe slightly higher: 503.084/1564.731 (+0.4%/+1.5%) |
|||
29 Apr 2008, 13:03 |
|
Kuemmel 29 Apr 2008, 21:29
penang wrote: Just post the result on my old Pentium-D 920 machine. Hm, your result seems to be very slow...does your cpu run at 2,8 MHz ? Any applications running on the side ? Regarding a stand-alone version, just use Quickman -> http://sourceforge.net/projects/quickman/ It's almost as fast as my code and quite nice interface... |
|||
29 Apr 2008, 21:29 |
|
Madis731 29 Apr 2008, 22:04
2.5GHz exactly!
|
|||
29 Apr 2008, 22:04 |
|
penang 02 May 2008, 14:23
[quote="Kuemmel"][quote="penang"]Just post the result on my old Pentium-D 920 machine.
I like fractals too. Wonder when you'd come out with a stand-alone version of the fractal program and let everybody play with it??[/quote] Hm, your result seems to be very slow...does your cpu run at 2,8 MHz ? Any applications running on the side ? Regarding a stand-alone version, just use Quickman -> [url]http://sourceforge.net/projects/quickman/[/url] It's almost as fast as my code and quite nice interface...[/quote] Hmmm.... I was wondering that too. Nothing running on the side, except of course, the TCPIP thing for online purpose, and firefox. |
|||
02 May 2008, 14:23 |
|
penang 02 May 2008, 14:53
[quote="Kuemmel"]
Regarding a stand-alone version, just use Quickman -> [url]http://sourceforge.net/projects/quickman/[/url] It's almost as fast as my code and quite nice interface...[/quote] It's many magnitude slower than your code, man. Just tried it on my old Pentium-D machine, the best it can get is 1834M Iterations / second. When you manage to get yours into the stand-alone style, please tell me. Wanna try yours out too !! :) Thanks in advance !! |
|||
02 May 2008, 14:53 |
|
Kuemmel 01 Jun 2008, 12:24
Finally after more or less 20 versions of testing for another enhanced FPU version and finding the optimum of loop unrolling I can release a new version called V0.53H (As usuall on http://www.mikusite.de/pages/x86.htm).
The speed up is up to 40 % of the FPU compared to the last one and up to 14 % for the SSE2. Changes: - FPU: optimized iteration code, 3 independent points iteration, 3 times loop unrolling. - SSE2: just loop unrolling changed from 1 time to 3 times. For the SSE2 version the further loop unrolling wasn't liked by Pentium-M and Core1 CPU's, so I included also the former version and called it 'V0.53H_SSE2_PM". For the FPU version I have to revise my former statements that a 3 independent points loop isn't good...after optimising the basic iteration code and going to further loop unrolling it proves to be better. It really flies on Core2Duo just proving what Agner Fod says in his manuals "The capacity of the execution ports and execution units is impressive."...and showing also that AMD Phenom isn't improved on FPU refering to AMD 64/Athlon design, just SSE2 was enhanced. I'll soon make some graphs about what has been achieved compared to my very first non-optimized single iteration loop. For Core2Duo I'm already at a plus of 120 to 150 %...it seems to have so much reserves when the coding is done right both on FPU and SSE2. I think it's really worth for all old floating point code to be looked at and probably changed. As one can see on the Pentium III results the enhancements don't show off on that old architecture much. |
|||
01 Jun 2008, 12:24 |
|
Xorpd! 02 Jun 2008, 01:09
KMB_V0.53H-32b-MT_FPU.exe 783.839
KMB_V0.53H-32b-MT_SSE2.exe 1838.962 KMB_V0.53H-32b-MT_SSE2_PM.exe 1791.471 All on my Core 2 Duo E6700 2.66 GHz 1 socket 2 cores XP-x64. Quite impressive that you got the FPU speed above what the SSE2 speed was when I first looked at your benchmark, late 2006. I know how hard it is to actually implement some of the ideas you have included in your latest efforts. Even harder given that you are working with fewer registers. The Core 2 Duo is indeed powerful, but as we know quite challenging to get it operating at anywhere near its full potential. |
|||
02 Jun 2008, 01:09 |
|
Goto page Previous 1, 2, 3 ... 12, 13, 14 ... 18, 19, 20 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.