flat assembler
Message board for the users of flat assembler.

Index > Windows > Mandelbrot Benchmark FPU/SSE2 released

Goto page Previous  1, 2, 3, 4, 5, 6 ... 18, 19, 20  Next
Author
Thread Post new topic Reply to topic
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 07 Jun 2006, 17:44
Sorry Madis for the disturbance, I didn't really think about your old post and now as I read the specs from Intel about hyper threading it's more clear.

I just wanted to have a proove that your single cpu with 2 virtual threads doesn't perform any better when you run a 4 thread version of the benchmark or even more threads. In some reviews about hyper threading it's not really clear if it only has 2 virtual cpus or could deal with even internally more threads and finally benefit from that. May be the pipeline can make use of more than 2 threads even on a HT single core...don't know...according to intel it seems not...just no prove...sorry, may be it's not needed.

I knew already about the possible 4 threads of the Prescott DC EE, actually that was why I went for a 4 thread version. And also because to make use or test some dual cpu boards with dual cores.
Post 07 Jun 2006, 17:44
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 07 Jun 2006, 17:51
Quote:

but as there have been some tests by Intel - some applications can perform well with 150-160 threads

That can hardly be compute-intensive threads, though?
Post 07 Jun 2006, 17:51
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 07 Jun 2006, 18:49
OH DEAR! Please, do not hold that thought - I lied - that was a BIG lie Very Happy

http://www.ddj.com/dept/64bit/184417069;jsessionid=3ZGTBVOLHTNZWQSNDBCSKHSCJUMEKJVN

This is the article - I messed up the milliseconds and threads and I didn't find out how many processors they used. Actually there where 4 sockets that each had 2 cores with HT - the system saw it as 16xCPU Razz

Forget everything about the hundreds of threads :S - they got it down to hundreds of milliseconds instead.

I'm sorry!
Post 07 Jun 2006, 18:49
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 07 Jun 2006, 19:57
Madis731 wrote:
http://www.ddj.com/dept/64bit/184417069;jsessionid=3ZGTBVOLHTNZWQSNDBCSKHSCJUMEKJVN

...interesting article though !!! They did almost the same like us here, instead they had a 4 cpu machine * 2 cores each * 2 due to HT = 16 ... so they created 16 threads each calculating a line of the fractal Smile I just wonder how slow the code finally was especially when not using SSE2 and assembler Wink Their speed gain was a factor of 7,26. I wonder how the KMB would perfom on this machine...I don't expect anybody here to have a machine like this Wink

EDIT: In a fast hack I set up a 16 thread version and send an e-mail to the authors, two guys from Intel...let's see if we get a result...I don't expect it, but who doesn't try can't win anyway Wink


Last edited by Kuemmel on 07 Jun 2006, 20:31; edited 1 time in total
Post 07 Jun 2006, 19:57
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 07 Jun 2006, 20:03
Would be nice testing on such a machine Smile

The fractal code *is* a pretty optimal (which is a bit rare) example of parallellism... there's not any resources you have to wait on (the inter-thread "communications" is handled by the single LOCK XADD), it doesn't access memory very much, etc.
Post 07 Jun 2006, 20:03
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 15 Jun 2006, 16:08
Some new results published...some Dual Board systems and Dual Board Dual Core combinations...scaling is still really good:
http://www.mikusite.de/pages/x86.htm
Post 15 Jun 2006, 16:08
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 15 Jun 2006, 21:41
*drool* Smile
Post 15 Jun 2006, 21:41
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 11 Jul 2006, 16:15
New version, 0.53 MT released
http://www.mikusite.de/pages/x86.htm
Major bug fix for systems with more than 4 cores. The thread cpu affinity bit mask was set in a wrong way. Lots of new results. Scaling still great even for 8 cores...and I think it would also scale the same for 64 cores what would be the maximum of WinXP Wink
Post 11 Jul 2006, 16:15
View user's profile Send private message Visit poster's website Reply with quote
sylwek32



Joined: 27 Apr 2006
Posts: 339
sylwek32 31 Oct 2006, 21:41
PIV 2.4 Ghz FPU 84.998
P IV 2.4 Ghz SSE2 219.483 :O

Good?
Post 31 Oct 2006, 21:41
View user's profile Send private message Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 01 Nov 2006, 16:56
sylwek32 wrote:
PIV 2.4 Ghz FPU 84.998
P IV 2.4 Ghz SSE2 219.483 :O

Good?


When you compare it to the top positions in the table on my site, of course not Wink ...but a standard result for a non hyperthreading enabled single core P4.
Post 01 Nov 2006, 16:56
View user's profile Send private message Visit poster's website Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd! 21 Dec 2006, 10:11
I have made some minor modifications to this benchmark. I am not sue whether they are correct or not.
Post 21 Dec 2006, 10:11
View user's profile Send private message Visit poster's website Reply with quote
madmatt



Joined: 07 Oct 2003
Posts: 1045
Location: Michigan, USA
madmatt 22 Dec 2006, 01:11
Hello Kuemmel, Just when you thought things coudn't get more interesting, we now have dual and quad core processors, sse3, (sse4?) Surprised . Keep up the good work making this code work on all these different processors Very Happy . Anyways, I had to buy a new computer after my power supply blew and all the parts in my computer got fried Crying or Very sad . But luckly, I found the money to get a new computer and now back in business again Cool. Here are my results from my new computer system.

CPU: Intel Celeron-D 3.2 ghz, 512MB ddr SDRAM

FPU: 101.087
SSE2: 235.109
SSE3: 235.287
Post 22 Dec 2006, 01:11
View user's profile Send private message Reply with quote
asmfan



Joined: 11 Aug 2006
Posts: 392
Location: Russian
asmfan 22 Dec 2006, 10:25
Extremely BAD source!!!
So much errors and mistakes
The critical ones:
Code:
proc WindowProc  hwnd, wmsg, wparam, lparam
        mov             eax, [wmsg]

        .if eax, e, WM_CLOSE
                invoke  DestroyWindow,[hwnd]
                ret     0
        .elseif eax, e, WM_DESTROY
                invoke  PostQuitMessage, 0
                ret     0
        .elseif eax, e, WM_PAINT
                invoke  BeginPaint, [hwnd], ps
                invoke  EndPaint, [hwnd], ps
                ret             0
        .endif

        invoke  DefWindowProc, [hwnd], [wmsg], [wparam], [lparam]
        ret
endp    

Stack corruption - STDCALL
not ret 0 but
Code:
xor eax,eax
ret    

I don't see the message loop anywhere...
code
Code:
invoke  CloseWindow, [mainhwnd]    

in the end is just minimizes the window and not destroys it...
Logic is terrible... Create threads every time in loop - terrible overhead

_________________
Any offers?
Post 22 Dec 2006, 10:25
View user's profile Send private message Reply with quote
AsmER



Joined: 25 Mar 2006
Posts: 64
Location: England
AsmER 22 Dec 2006, 13:26
Here is mine:

AMD Athlon 64 3500+ 2.19GHz (1GB RAM)

FPU: 145.429 [million iterations/sec]
SSE2: 197.678 [million iterations/sec]

Well, I didn't expect more from my old laptop Rolling Eyes


PS
asmfan's right

_________________
;\\ http://theasmer.spaces.live.com \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
Post 22 Dec 2006, 13:26
View user's profile Send private message Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd! 23 Dec 2006, 12:34
asmfan wrote:
Extremely BAD source!!!
So much errors and mistakes


I agree the code looks a little rough and ready. Perhaps the message loop got deleted at some point during development because invoking the nondefault cases in the WindProc could cause a crash. There probably was one there to start with because there is an MSG structure in the code. The creation of threads in the loop was not nearly the biggest performance problem with the code. At this page the biggest performance problem was pointed out. I have addressed this problem and done some load balancing and have obtained significant performance increase. Still a little slow, though -- perhaps the biggest problem left is what happens if an exceptional value is generated, as seems quite likely. Might this explain most of the remaining performance deficit?
Post 23 Dec 2006, 12:34
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 23 Dec 2006, 16:51
The threading stuff was retrofitted by me, and I opted for for "least path of resistance" Smile
Post 23 Dec 2006, 16:51
View user's profile Send private message Visit poster's website Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd! 24 Dec 2006, 09:53
My concern that exceptional values might slow the calculation down seems to have been unfounded. I tried guarding against exceptional values in the single struction stream version and only got 3% improvement, so I didn't implement this in dual and triple instruction stream versions. I did get some improvement by changing some movapd instructions to load/store sequences, and by unrolling the inner loop by 4. The effect of the last few revisions has been to move my machine up about 13 places on the chart, and since the next computer requires new hardware to attack, I think I'm done for the year. Enjoy the holidays if you're fortunate enough to have holidays.
Post 24 Dec 2006, 09:53
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 28 Dec 2006, 05:46
asmfan wrote:
Extremely BAD source!!!
So much errors and mistakes
...
create threads every time in loop - terrible overhead


Hi asmfan...sorry, was my first try on FASM and on x86 assembler at all and in the end was the invocation of 3 different poeple...

...are you sure about much overhead ? XORPD didn't find much of a gain...could you point out at the errors, how and where to fix code ? I still got to learn a lot !

In the meantime I found a quite fast fractal generator:
http://sourceforge.net/projects/quickman/

I think it's almost double the performance regarding SSE2 version compared to mine regarding one core. I had a talk to the author and he calculates 4 pixels (in two seperate SSE registers) at one time by interleaving instructions for 2 pixels (in one SSE register) in one loop and also other minor stuff.

Unfortunatelly no multi-core support. He didn't got time and he gained speed also by using global variables (why is that ?)...may be I take some time and try to invoke his code in my benchmark...

Cheers & in advance a happy new year !
Post 28 Dec 2006, 05:46
View user's profile Send private message Visit poster's website Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd! 28 Dec 2006, 08:35
I tried the quickman code on my home PC, Max Iters = 2048, Palette = 3 Muted, Precision = Double, and for Exact, Intel = 710.5; Exact, AMD = 670.0. For comparison, using KMB_V0.53_MT.zip from your website, I got FPU = 337.049, SSE2 = 739.972, SSE3 = 727.863, Vodnaya = 679.789, while KMB_V0.56_MT.ZIP (hope you didn't mind me calling it this; I could rename it if you would like) from my website gets X1 = 787.480, X2 = 1190.608, X3 = 1653.883. So it can be seen that the strategy of creating more instruction streams rather than carefully interweaving only 2 instruction streams is coming out slightly ahead (only slighty taking into account quickman's lack of threading.)
Post 28 Dec 2006, 08:35
View user's profile Send private message Visit poster's website Reply with quote
asmfan



Joined: 11 Aug 2006
Posts: 392
Location: Russian
asmfan 28 Dec 2006, 20:14
Sorry Kuemmel, didn't want to offend you by saying that
i like the idea of prog but still there are a lot to improve (still thinkin that suspending/resuming threads is better than creating/destroying).
Also is there newer *fixed* versions for Win32 not 64? I found mods by xorpd! but the only for 64bit Win... too bad(

_________________
Any offers?
Post 28 Dec 2006, 20:14
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3, 4, 5, 6 ... 18, 19, 20  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.