flat assembler
Message board for the users of flat assembler.
Index
> Windows > Mandelbrot Benchmark FPU/SSE2 released Goto page Previous 1, 2, 3 ... 13, 14, 15 ... 18, 19, 20 Next |
Author |
|
Madis731 02 Jun 2008, 09:06
Liek WOW!
EDIT: Sorry, I lied I first tested a T7200 and thought it was a T9300, here are the corrected stats: Code: ;T7200 / 64-bit 2003 Server / 1gig of RAM / integrated graphics Kümmel Mandelbrot Benchmark V 0.53H-32b-MT_FPU Speed [Million Iterations / Second] : 571,990 Kümmel Mandelbrot Benchmark V 0.53H-32b-MT_SSE2 Speed [Million Iterations / Second] : 1338,323 Kümmel Mandelbrot Benchmark V 0.53H-32b-MT_SSE2_PM Speed [Million Iterations / Second] : 1304,760 FPU eff. like on your homepage: 142,998 SSE2 eff. 334,581 SSE2:FPU == 2,34:1 Now the real T9300: Code: ;T9300 / 64-bit 2003 Server / 4gigs of RAM / integrated graphics Kümmel Mandelbrot Benchmark V 0.53H-32b-MT_FPU Speed [Million Iterations / Second] : 697,183 Kümmel Mandelbrot Benchmark V 0.53H-32b-MT_SSE2 Speed [Million Iterations / Second] : 1607,021 Kümmel Mandelbrot Benchmark V 0.53H-32b-MT_SSE2_PM Speed [Million Iterations / Second] : 1549,200 FPU eff. like on your homepage: 139,437 SSE2 eff. 321,404 SSE2:FPU == 2,305:1 Last edited by Madis731 on 02 Jun 2008, 11:46; edited 2 times in total |
|||
02 Jun 2008, 09:06 |
|
Ivan2k2 02 Jun 2008, 10:08
penryn t8100, vista 32bit sp1
fpu - 609 sse2 - 1430 sse2pm - 1372 |
|||
02 Jun 2008, 10:08 |
|
Kuemmel 02 Jun 2008, 18:23
...thanks for all the testing ! All in line with my other results !
Here are some comparison graphs to see what was achieved now with the latest evolution KMB V0.53H (at SSE2 'G' for Pentium M) compared to my very first non optimized code with single instruction lines, no different exits, no loop unrolling, released almost 2 years before (KMB V0.53): FPU-Version: The verdict here is, that I thought at first neither Intel or AMD improved their FPU and all were the same level except Pentium 4...what was clearly wrong after seeing the latest results. Intel did a hell of improvement with the Core2Duo when you find out what this cpu needs...different instruction lines and loop unrolling to make full use of the out of order architecture and those execution units. Except of the 4 cores Phenom lacks of any improvement. SSE2-Version: Again Core2Duo with the lead, AMD couldn't keep up even with the same extension to the 128bit SSE2 bandwith, but still of course much better than AMD 64 design. Strange that Pentium M is even a little slower compared to FPU version. I'm really keen on seeing results now of the upcoming stuff like VIA Nano, Intel Core2Duo Nehalem (Hyperthreading) and long time later Core2Duo Sandy with 256bit SSE2 bandwith...in the meantime I still search for a result for Pentium 4 with Hyperthreading to indicate again the benefit of it. I guess with non optimized code Hyperthreading would help with Core2Duo. |
|||
02 Jun 2008, 18:23 |
|
Kuemmel 01 Jul 2008, 22:51
A guy with a 16 core (4 CPU Quad Core) was testing my benchmark, have a look at:
http://forums.2cpu.com/showthread.php?t=76178&page=8 Problem is even when I made my benchmark 10 times lasting longer his cores are not utilized to full extend. Though I think everything works fine with single cpu quad cores. Any clues from you guys...some problem with the threading code here or whatever with the 4 cpu machine ? These problems where seen before on some dual cpu machines... |
|||
01 Jul 2008, 22:51 |
|
rugxulo 07 Jul 2008, 09:16
Quote:
Holy crap, Batman! (And you're wondering why it isn't faster??) |
|||
07 Jul 2008, 09:16 |
|
revolution 07 Jul 2008, 10:03
Kuemmel wrote: Problem is even when I made my benchmark 10 times lasting longer his cores are not utilized to full extend. Though I think everything works fine with single cpu quad cores. |
|||
07 Jul 2008, 10:03 |
|
LocoDelAssembly 07 Jul 2008, 20:41
Sorry for posting without checking but does your program writes to the video buffer directly? Since writes to video card memory must not be cached it is possible that the reason is what revolution says, otherwise the cache memory of each core should help to prevent such memory bottlenecks.
|
|||
07 Jul 2008, 20:41 |
|
Kuemmel 10 Jul 2008, 18:21
...thanks for the comments...no clue yet. I don't write directly to the video buffer, but also in the past doing it or not didn't have any effect...so I still wait for some tests of the guy with the huge machine...until now at least with a single quad core there was no trouble at all and all cores at 100 % load...
...in the meantime I got also a result for the Intel Atom on my webpage...what a huge step back in CPU technology...okay, wasn't meant to be very good and to save power, but still...why go back to this in-order-architecture...is that really the point to save power consumption !??? |
|||
10 Jul 2008, 18:21 |
|
revolution 11 Jul 2008, 00:41
Kuemmel wrote: in-order-architecture...is that really the point to save power consumption !??? |
|||
11 Jul 2008, 00:41 |
|
Kuemmel 11 Jul 2008, 17:09
...well, for me the Atom is really a joke...on (sorry, it's german) one can see that an AMD 64 with 1 GHz is faster and consumes less power:
http://www.tomshardware.com/de/athlon-2000-Atom-230-Undervolting,testberichte-240084.html The real star of energy saving platform is probably the Tegra platform (Nvidia + ARM11), if somebody would made a small notebook with it ...sorry for being off topic : http://sg.nvidia.com/page/handheld.html |
|||
11 Jul 2008, 17:09 |
|
bitRAKE 15 Jul 2008, 05:57
Might I suggest another project of a similar nature?
Barnes-Hut N-body algorithm would be a interesting challenge/benchmark. _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
15 Jul 2008, 05:57 |
|
Kuemmel 15 Jul 2008, 21:55
bitRAKE wrote: Might I suggest another project of a similar nature? ...why not...I googled a bit around, found some stuff about gravity attraction like stars or something regarding that n-body thing...do you know any good C-code implementation with visualisation to start with !? ...and yes, I'm still thinking what to code next, I found also some nice stuff like singularities: http://www.imaginary2008.de/surfer.php (just a small formula describes these surfaces) ...though I wonder if it's visualized by a raytracing algorithm, anybody ever did a simple raytracer in ASM ? |
|||
15 Jul 2008, 21:55 |
|
f0dder 16 Jul 2008, 00:15
So they undervolt the AMD64 but keep the ATOM running at stock voltage? Perhaps the ATOM could run undervolted as well? And what about long-term stability? Interesting test anyway, too bad it's in german (WHEN will people learn to only publish in English? ).
I still think the idea behind the ATOM is OK, and considering it's basically first-gen, it's not too bad. With a 2nd-gen ATOM (we'll see...) and a more optimized chipset, it doesn't seem like a bad idea to me. Also, in-order CPUs are easier to hand-optimize for than OOO. |
|||
16 Jul 2008, 00:15 |
|
bitRAKE 16 Jul 2008, 18:08
Kuemmel wrote:
http://www.amara.com/papers/nbody.html _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
16 Jul 2008, 18:08 |
|
adnimo 31 Jul 2008, 14:00
Did you guys benchmark on a P2? I have one dusting at home it's a 333mhz Pentium II, I could set it up if there's still a need for it.
|
|||
31 Jul 2008, 14:00 |
|
bitRAKE 02 Aug 2008, 09:07
2x L5410 (8 cores)
4190.117 - SSE2 2312.324 - FPU (running on Vista x64) Should add a column in the stats for power efficiency (Million Itterations / Watts TDP). _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
02 Aug 2008, 09:07 |
|
Kuemmel 03 Aug 2008, 17:10
adnimo wrote: Did you guys benchmark on a P2? I have one dusting at home it's a 333mhz Pentium II, I could set it up if there's still a need for it. Hi Adnimo, if you got time, why not...just I hope you can have WinXP running on it, if that's possible at all ? ...because I got one report with an Pentium II and Win98 failing the bench to run... |
|||
03 Aug 2008, 17:10 |
|
Kuemmel 03 Aug 2008, 19:56
bitRAKE wrote: 2x L5410 (8 cores) ...regarding the efficiency per core per MHz this machine seems to have the same problem like the one I reported here before on that www.2cpu.com forum. Do you have any conclusions why it's full potential (100% load) isn't used on that 2 cpu machines !? |
|||
03 Aug 2008, 19:56 |
|
bitRAKE 04 Aug 2008, 04:20
Currently, i'm using the on board video, and doubt that has any effect on the results. My guess would be memory contention of the thread data between the two cpus. This could be easily tested by having threads select a data area based on which cpu is being used - cacheline aligned and all that goodness. Eh, I'm lazy though, so maybe just 16 copies of the work you've already done - should see a change. Dirty cachelines going across the bus has to slow things down.
_________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
04 Aug 2008, 04:20 |
|
Goto page Previous 1, 2, 3 ... 13, 14, 15 ... 18, 19, 20 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.