flat assembler
Message board for the users of flat assembler.
Index
> Windows > Mandelbrot Benchmark FPU/SSE2 released Goto page Previous 1, 2, 3, 4, 5, 6 ... 18, 19, 20 Next |
Author |
|
f0dder 07 Jun 2006, 17:51
Quote:
That can hardly be compute-intensive threads, though? |
|||
07 Jun 2006, 17:51 |
|
Madis731 07 Jun 2006, 18:49
OH DEAR! Please, do not hold that thought - I lied - that was a BIG lie
http://www.ddj.com/dept/64bit/184417069;jsessionid=3ZGTBVOLHTNZWQSNDBCSKHSCJUMEKJVN This is the article - I messed up the milliseconds and threads and I didn't find out how many processors they used. Actually there where 4 sockets that each had 2 cores with HT - the system saw it as 16xCPU Forget everything about the hundreds of threads :S - they got it down to hundreds of milliseconds instead. I'm sorry! |
|||
07 Jun 2006, 18:49 |
|
Kuemmel 07 Jun 2006, 19:57
Madis731 wrote: http://www.ddj.com/dept/64bit/184417069;jsessionid=3ZGTBVOLHTNZWQSNDBCSKHSCJUMEKJVN ...interesting article though !!! They did almost the same like us here, instead they had a 4 cpu machine * 2 cores each * 2 due to HT = 16 ... so they created 16 threads each calculating a line of the fractal I just wonder how slow the code finally was especially when not using SSE2 and assembler Their speed gain was a factor of 7,26. I wonder how the KMB would perfom on this machine...I don't expect anybody here to have a machine like this EDIT: In a fast hack I set up a 16 thread version and send an e-mail to the authors, two guys from Intel...let's see if we get a result...I don't expect it, but who doesn't try can't win anyway Last edited by Kuemmel on 07 Jun 2006, 20:31; edited 1 time in total |
|||
07 Jun 2006, 19:57 |
|
f0dder 07 Jun 2006, 20:03
Would be nice testing on such a machine
The fractal code *is* a pretty optimal (which is a bit rare) example of parallellism... there's not any resources you have to wait on (the inter-thread "communications" is handled by the single LOCK XADD), it doesn't access memory very much, etc. |
|||
07 Jun 2006, 20:03 |
|
Kuemmel 15 Jun 2006, 16:08
Some new results published...some Dual Board systems and Dual Board Dual Core combinations...scaling is still really good:
http://www.mikusite.de/pages/x86.htm |
|||
15 Jun 2006, 16:08 |
|
f0dder 15 Jun 2006, 21:41
*drool*
|
|||
15 Jun 2006, 21:41 |
|
Kuemmel 11 Jul 2006, 16:15
New version, 0.53 MT released
http://www.mikusite.de/pages/x86.htm Major bug fix for systems with more than 4 cores. The thread cpu affinity bit mask was set in a wrong way. Lots of new results. Scaling still great even for 8 cores...and I think it would also scale the same for 64 cores what would be the maximum of WinXP |
|||
11 Jul 2006, 16:15 |
|
sylwek32 31 Oct 2006, 21:41
PIV 2.4 Ghz FPU 84.998
P IV 2.4 Ghz SSE2 219.483 :O Good? |
|||
31 Oct 2006, 21:41 |
|
Kuemmel 01 Nov 2006, 16:56
sylwek32 wrote: PIV 2.4 Ghz FPU 84.998 When you compare it to the top positions in the table on my site, of course not ...but a standard result for a non hyperthreading enabled single core P4. |
|||
01 Nov 2006, 16:56 |
|
Xorpd! 21 Dec 2006, 10:11
I have made some minor modifications to this benchmark. I am not sue whether they are correct or not.
|
|||
21 Dec 2006, 10:11 |
|
madmatt 22 Dec 2006, 01:11
Hello Kuemmel, Just when you thought things coudn't get more interesting, we now have dual and quad core processors, sse3, (sse4?) . Keep up the good work making this code work on all these different processors . Anyways, I had to buy a new computer after my power supply blew and all the parts in my computer got fried . But luckly, I found the money to get a new computer and now back in business again . Here are my results from my new computer system.
CPU: Intel Celeron-D 3.2 ghz, 512MB ddr SDRAM FPU: 101.087 SSE2: 235.109 SSE3: 235.287 |
|||
22 Dec 2006, 01:11 |
|
asmfan 22 Dec 2006, 10:25
Extremely BAD source!!!
So much errors and mistakes The critical ones: Code: proc WindowProc hwnd, wmsg, wparam, lparam mov eax, [wmsg] .if eax, e, WM_CLOSE invoke DestroyWindow,[hwnd] ret 0 .elseif eax, e, WM_DESTROY invoke PostQuitMessage, 0 ret 0 .elseif eax, e, WM_PAINT invoke BeginPaint, [hwnd], ps invoke EndPaint, [hwnd], ps ret 0 .endif invoke DefWindowProc, [hwnd], [wmsg], [wparam], [lparam] ret endp Stack corruption - STDCALL not ret 0 but Code: xor eax,eax
ret I don't see the message loop anywhere... code Code: invoke CloseWindow, [mainhwnd] in the end is just minimizes the window and not destroys it... Logic is terrible... Create threads every time in loop - terrible overhead _________________ Any offers? |
|||
22 Dec 2006, 10:25 |
|
AsmER 22 Dec 2006, 13:26
Here is mine:
AMD Athlon 64 3500+ 2.19GHz (1GB RAM) FPU: 145.429 [million iterations/sec] SSE2: 197.678 [million iterations/sec] Well, I didn't expect more from my old laptop PS asmfan's right |
|||
22 Dec 2006, 13:26 |
|
Xorpd! 23 Dec 2006, 12:34
asmfan wrote: Extremely BAD source!!! I agree the code looks a little rough and ready. Perhaps the message loop got deleted at some point during development because invoking the nondefault cases in the WindProc could cause a crash. There probably was one there to start with because there is an MSG structure in the code. The creation of threads in the loop was not nearly the biggest performance problem with the code. At this page the biggest performance problem was pointed out. I have addressed this problem and done some load balancing and have obtained significant performance increase. Still a little slow, though -- perhaps the biggest problem left is what happens if an exceptional value is generated, as seems quite likely. Might this explain most of the remaining performance deficit? |
|||
23 Dec 2006, 12:34 |
|
f0dder 23 Dec 2006, 16:51
The threading stuff was retrofitted by me, and I opted for for "least path of resistance"
|
|||
23 Dec 2006, 16:51 |
|
Xorpd! 24 Dec 2006, 09:53
My concern that exceptional values might slow the calculation down seems to have been unfounded. I tried guarding against exceptional values in the single struction stream version and only got 3% improvement, so I didn't implement this in dual and triple instruction stream versions. I did get some improvement by changing some movapd instructions to load/store sequences, and by unrolling the inner loop by 4. The effect of the last few revisions has been to move my machine up about 13 places on the chart, and since the next computer requires new hardware to attack, I think I'm done for the year. Enjoy the holidays if you're fortunate enough to have holidays.
|
|||
24 Dec 2006, 09:53 |
|
Kuemmel 28 Dec 2006, 05:46
asmfan wrote: Extremely BAD source!!! Hi asmfan...sorry, was my first try on FASM and on x86 assembler at all and in the end was the invocation of 3 different poeple... ...are you sure about much overhead ? XORPD didn't find much of a gain...could you point out at the errors, how and where to fix code ? I still got to learn a lot ! In the meantime I found a quite fast fractal generator: http://sourceforge.net/projects/quickman/ I think it's almost double the performance regarding SSE2 version compared to mine regarding one core. I had a talk to the author and he calculates 4 pixels (in two seperate SSE registers) at one time by interleaving instructions for 2 pixels (in one SSE register) in one loop and also other minor stuff. Unfortunatelly no multi-core support. He didn't got time and he gained speed also by using global variables (why is that ?)...may be I take some time and try to invoke his code in my benchmark... Cheers & in advance a happy new year ! |
|||
28 Dec 2006, 05:46 |
|
Xorpd! 28 Dec 2006, 08:35
I tried the quickman code on my home PC, Max Iters = 2048, Palette = 3 Muted, Precision = Double, and for Exact, Intel = 710.5; Exact, AMD = 670.0. For comparison, using KMB_V0.53_MT.zip from your website, I got FPU = 337.049, SSE2 = 739.972, SSE3 = 727.863, Vodnaya = 679.789, while KMB_V0.56_MT.ZIP (hope you didn't mind me calling it this; I could rename it if you would like) from my website gets X1 = 787.480, X2 = 1190.608, X3 = 1653.883. So it can be seen that the strategy of creating more instruction streams rather than carefully interweaving only 2 instruction streams is coming out slightly ahead (only slighty taking into account quickman's lack of threading.)
|
|||
28 Dec 2006, 08:35 |
|
asmfan 28 Dec 2006, 20:14
Sorry Kuemmel, didn't want to offend you by saying that
i like the idea of prog but still there are a lot to improve (still thinkin that suspending/resuming threads is better than creating/destroying). Also is there newer *fixed* versions for Win32 not 64? I found mods by xorpd! but the only for 64bit Win... too bad( _________________ Any offers? |
|||
28 Dec 2006, 20:14 |
|
Goto page Previous 1, 2, 3, 4, 5, 6 ... 18, 19, 20 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.