flat assembler
Message board for the users of flat assembler.

Index > Windows > Mandelbrot Benchmark FPU/SSE2 released

Goto page 1, 2, 3 ... 18, 19, 20  Next
Author
Thread Post new topic Reply to topic
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 17 Apr 2006, 19:05
Hi people,

due to a lot of help here and spending various hours looking on lots of source codes, finding fast routines, I coded a mandelbrot fractal benchmark to see how good or bad recent CPU's are on double precision calculation power. You can download the executable and source from:
http://www.mikusite.de/pages/x86.htm
It's the 'KMB 0.3' file.

What is interesting, is that the FPU of the Pentium 4 is really bad, regarding the clock speed, but on SSE2 it flies. On the other hand, normalised to the clock speed the efficiency of AMD is the same. Other interesting finding is the good performance of the Pentium M and that AMD didn't change anything on their FPU since years, when I look at the result of my old Athlon.

Does anybody got PII or PIII or K6 to test ?

Would be interesting. The FPU code should work starting from Pentium Pro level (I used a command not in the instruction set of earlier Pentiums). The SSE2 of course only for compatible CPU's. Didn't implement some SSE2 detection, should be done in the future. The graphics display is written in DirectDraw and consumes less than 1 or 2% of the computation time.

Any comments or results welcome. It's my first ever x86 assembler coded stuff, so don't kick me too hard Wink
Post 17 Apr 2006, 19:05
View user's profile Send private message Visit poster's website Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid 17 Apr 2006, 19:34
ntoebook with AMD turion64

FPU - 99.941, then 2 times 102.something
SSE - 139.518, then 140

good?

ps: good start Wink
Post 17 Apr 2006, 19:34
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 17 Apr 2006, 20:04
Thanx Vid !

Seems that you got a 1.6 Mhz Turion !? Normally it scales exactly with clock speed, so I expect same values for Turion and my Sempron normalized to clock speed, as you would see from the table on the web page. This benchmark is totally independend of main Memory/Cache Size or 64bit. Ah yes, forgot to say, that it's really good to do it 3 times or so and take the best value...and for the notebooks especially to switch of all the power saving stuff.
Post 17 Apr 2006, 20:04
View user's profile Send private message Visit poster's website Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid 17 Apr 2006, 20:50
yes, your calculation is precise, it's 1.6ghz
Post 17 Apr 2006, 20:50
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
madmatt



Joined: 07 Oct 2003
Posts: 1045
Location: Michigan, USA
madmatt 17 Apr 2006, 22:02
Hi Kuemmel,
Here are my results on a 2.7ghz Celeron Processor (Desktop Computer):
FPU -> 95.314
SSE2 -> 223.737

Have a question, Are you including video writes in your timing? Do you have a version that plots in system memory? This may improve speed on some systems.
Post 17 Apr 2006, 22:02
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 18 Apr 2006, 04:46
AMD x2 3800+ 2ghz per core, 1gb ddr lowend ram
132,005 FPU
179,135 SSE

I'm running Windows XP64bit, so I think my results are skewed a little because the app is running over the WOW64 for 32bit compatibility.
Post 18 Apr 2006, 04:46
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 18 Apr 2006, 05:47
Thanx guys, I'll include your results in my table soon !

@madmatt: Yes, I include video times. But on my Sempron 3100+ I figured out that it takes less than 1% of computation time, so I think it's okay to include it. Of course to make a total real comparison it would be better to take it out, but somehow I don't like to make a benchmark let's say 'too synthetic' and more real world. But thanks for the hint with the memory, at the moment I plot directly on the screen, as far as I understand DirectDraw.
Post 18 Apr 2006, 05:47
View user's profile Send private message Visit poster's website Reply with quote
cod3b453



Joined: 25 Aug 2004
Posts: 618
cod3b453 18 Apr 2006, 07:10
I keep getting memory access violation Sad

...
004011BB 8B00 mov eax, dword ptr [eax] ; from cominvk ???
...
Post 18 Apr 2006, 07:10
View user's profile Send private message Reply with quote
chris



Joined: 05 Jan 2006
Posts: 62
Location: China->US->China->?
chris 18 Apr 2006, 11:13
My result:
FPU 102.89
SSE@ 134.817
on Pentium M 1.6GHz

The strange thing is that the compiled program runs ok but if I compile them on my machine and I also get access violation in MessageBoxA. It seems like a stack corruption, since the eip is loaded with some value in the stack.

Code:
...
    invoke  CloseWindow,[mainhwnd]
    ;int3
    invoke  MessageBox,NULL,text_result,caption,[flags] ; <-- 0xc0000005
    invoke  ExitProcess, [msg.wParam]
...
    
Post 18 Apr 2006, 11:13
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 18 Apr 2006, 13:52
AMD Athlon64 3200+ (socket 939, clock 2.0 GHz)
FPU 129.594
SSE 176.149

My SSE performance is worst than a Celeron Sad
Post 18 Apr 2006, 13:52
View user's profile Send private message Reply with quote
Vasilev Vjacheslav



Joined: 11 Aug 2004
Posts: 392
Vasilev Vjacheslav 18 Apr 2006, 14:43
Intel Northwood (socket 475, clock 2,3 GHz)

fpu: 92.141
sse: 220.615

ps. approx. as madmatt have
Post 18 Apr 2006, 14:43
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 18 Apr 2006, 17:24
Interesting results, AMD processors seem to have better FPU but worse SSE performance. I wonder if this holds true in 64bit as well or if it's just a side affect of 32bit computing on a 64bit processor .?
Post 18 Apr 2006, 17:24
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 18 Apr 2006, 17:53
@r22: Basically you are right, on the other hand it looks both have the same SSE2 unit, just Intel can clock their processor higher...for 64bit I don't know...don't expect any change. Just for the Intel Conroe CPU coming in 3rd quarter this year, Intel promised an almost double performance of SSE2 regarding the same clock speed of a Pentium M, on who's architecture it's based. Early benchmarks seem to prove this. There are even rumors that the hole FPU unit will be kicked out in some years and switched completely to SSE. The benefit is obvious, when you program it...the FPU has this stupid 8 register stack and SSE has 'real' 8 registers. I hope to manage to code a multi-threaded version of the benchmark, so dual core processor should be able to do really double speed.

Regarding the compiler errors/other errors...I don't know yet. My include directory is a mess of a lot of different include files from my start with flat assembler, older versions and include files from other users here. I will try compiling it with a 'fresh' latest FASM installation...may be I will come to some conclusion about it...or did anybody manage to compile it already on a 'fresh' FASM installation ?
Post 18 Apr 2006, 17:53
View user's profile Send private message Visit poster's website Reply with quote
vbVeryBeginner



Joined: 15 Aug 2004
Posts: 884
Location: \\world\asia\malaysia
vbVeryBeginner 18 Apr 2006, 18:22
my result. on p4 2.66
fpu: 94+-
sse2: 195+-
Post 18 Apr 2006, 18:22
View user's profile Send private message Visit poster's website Reply with quote
UCM



Joined: 25 Feb 2005
Posts: 285
Location: Canada
UCM 18 Apr 2006, 20:48
hmm, odd...
on my Athlon 64 X2 4200+ 2.2ghz:
Test 1
FPU 142.006
SSE2 195.621

Test 2
FPU 142.964
SSE2 195.621
Post 18 Apr 2006, 20:48
View user's profile Send private message Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 18 Apr 2006, 22:29
I put all new results from you on my webpage, have a look as you like ! The list is ordered according to maximum SSE2 performance. But for the FPU you can easily see that the Athlon's got the crown. Everything of your results more or less like expected, except:

@Vasilev Vjacheslav: Are you sure your CPU runs at 2,3 Ghz ? If I'm not wrong, should be more like 2,5 Ghz or they did some nice enhancements for the Northwood P4.

@vbVeryBeginner: Your results seem to be too slow...sure you don't run other applications or are in power save mode or something ?
Post 18 Apr 2006, 22:29
View user's profile Send private message Visit poster's website Reply with quote
vbVeryBeginner



Joined: 15 Aug 2004
Posts: 884
Location: \\world\asia\malaysia
vbVeryBeginner 18 Apr 2006, 23:21
@kuemmel
is directx version an issue?
the dxdiag shows me 8.1 (4.08.01.0810) and i am using onboard vga via/s3g unichrome igp
Post 18 Apr 2006, 23:21
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 19 Apr 2006, 08:39
http://enos.itcollege.ee/~mkalme/PAHN/Up/cpu-2992.png
FPU: 87.704
SSE2: 212.521

The Pentium IIIs were the best CPUs ever - I still got one at home - and I'm not planning on letting go Smile
http://enos.itcollege.ee/~mkalme/PAHN/Up/cpu-697.png - too bad this hasn't got an SSE2/3 instruction set, but at least it has SSE. I will test for FPU on it when I get home...

I'm a bit worried about the switch between FPU and SSE in your SSE-code:
Code:
         fstp  qword[rz_temp_fpu]
;Here it must wait for FPU to finish waht its doing
         xorpd     xmm2,xmm2               ; xmm2:    0    |   0       (zeros xmm6) 
    

Code:
       shufpd    xmm5,xmm5,0               ; xmm5:    iz   |  iz

       MOV [plot_x],0
       x_loop:
;Here its the other way round - although SSE is MUCH faster so I'm not too much worried about this...
         fld   qword[dz] 
    


Is there a possibility to make all the code in SSE?
Second suggestion is that when Pentium 4s are so bad in branch prediction penalties and lots of other stuff Very Happy you can make minor replacements like all INCs/DECs with ADD...1/SUB...1 Wink

Maybe then (hopefully) Pentium 4s won't lose *that* much Twisted Evil
Post 19 Apr 2006, 08:39
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 19 Apr 2006, 17:11
Hi guys,

I uploaded an INCLUDE directory on my webpage. So for those who want to recompile it, this should help. Get it from the same page like the benchmark. It works with the latest FASM version. But I didn't had time to check what's really wrong or different. It didn't work with a normal new installtion of the latest FASM.

@vbVeryBeginner: Hm, don't know, I also run it on a very cheap shared memory graphics card on my notebook with an AMD Sempron. Here the results are consistent...any virus scanner, whatever resource-stealing-software running ?

@madis731:
I stay tuned for your PIII-result ! Your P4-result is also strange (check my website, I included it). It seems too slow compared to others and the P4 who has the lead has a real shitty graphics card !

About SSE adaption for PIII, the problem is that SSE has only single precision, SSE2 introduced double precision. So theoretically I could even calculate 4 pixels in one time than 2 like now (SSE registers are always 128bit, so either 4 packed single precision 4x32bit or 2 packed double 2x64bit)...just when you go very 'deep' into a mandelbrot fractal single precision isn't enough precise.

You are right with the mix of FPU and SSE2, this could be optimized to only SSE2, but at the moment it's not really a point, as the routines spend up to 2400 times more time in the SSE2-only or FPU-only iteration loop and not in the part of the code you mentioned. But I keep it in mind for next release.

I'm still worried about some P4 results...seems to be a strange variation where I thought the CPU core is really the same and the graphics normally shouldn't matter like you see on AMD systems...
Post 19 Apr 2006, 17:11
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 19 Apr 2006, 20:47
Hi, I'm back with my PIII results:
the exact specs are the ones that I already linked http://enos.itcollege.ee/~mkalme/PAHN/Up/cpu-697.png
Code:
FPU: 43.731
;When you calculate the FPU speed then its 62473i/MHz
;which is almost as good as other CPUs in that category
SSE2: crashed of course Wink
    


Last edited by Madis731 on 20 Apr 2006, 12:21; edited 1 time in total
Post 19 Apr 2006, 20:47
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2, 3 ... 18, 19, 20  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.