flat assembler
Message board for the users of flat assembler.

Index > Windows > Mandelbrot Benchmark FPU/SSE2 released

Goto page Previous  1, 2, 3, ... 18, 19, 20  Next
Author
Thread Post new topic Reply to topic
vbVeryBeginner



Joined: 15 Aug 2004
Posts: 884
Location: \\world\asia\malaysia
vbVeryBeginner 20 Apr 2006, 01:14
here is my cpu info
http://sulaiman.netadvant.com/fasm/cpu.png

support, MMX,SSE,SSE2,SSE3,EM64T,
the EM64T sounds like it could emulates 64bits instruction? Surprised

i just google for winxp 64bits, it seems that
http://www.microsoft.com/windowsxp/64bit/facts/trial.mspx
Supported processors: AMD Athlon 64, AMD Opteron, Intel Xeon with Intel EM64T support, Intel Pentium 4 with Intel EM64T support

maybe i am in good luck? to execute 64bits winxp?
Post 20 Apr 2006, 01:14
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 20 Apr 2006, 02:44
I rewrote the FPU code in the SSE source to only use SSE. There was NO performance gain. I think the 32bit DirectDraw WOWing on my WinXP64 is hindering my results. I'd port it to 64bit but I don't have any example/includes of DirectDraw for Win64.

By using SSE the source can be optimized because the XMMX registers retain the values unlike the FPU stack.


Description:
Download
Filename: KMB_sse2_v0.3.ASM
Filesize: 19.84 KB
Downloaded: 423 Time(s)

Post 20 Apr 2006, 02:44
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 20 Apr 2006, 03:00
r22, your version is a little worst, now I have 173 instead of 176. I'm using WinXP32 SP2 (but my CPU is an athlon64 2 GHz).
Post 20 Apr 2006, 03:00
View user's profile Send private message Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 20 Apr 2006, 20:55
Hi people,

I put all your results on my homepage. It would be still cool to get some results for the following processors, so if you know somebody, who's got one, would be nice if I can get a result:

- Intel Pentium Pro
- Intel Pentium II
- Intel Core Solo/Duo

It won't work on any AMD K6 or earlier, also classic Pentium or Pentium MMX won't work, as I use the 'FCOMIP' instruction. Without using that there was an even greater performance loss for Intel Pentium 4 processor as it is now.

What could be a problem is the operating system. My benchmark wasn't working on a Pentium III running on W98SE...don't know why...on the PIII from Madis731 there wasn't any problem...!?

For the Pentium 4 it's really funny to see how a whole engineering group of the market leader did a lot of shit on the FPU side but doing something nice with the SSE...but the problems with the FPU seem to justify to kick the whole concept on the trash and go back to PIII/M style coupled SSE with the upcoming Intel Conroe Wink
Post 20 Apr 2006, 20:55
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 21 Apr 2006, 05:35
I've got Windows 2000 on both of the machines I've tried on - P!!! and P4 Smile I think I can manage to find a W98 machine if I try. I've had some P4 Dual Cores from 2.8 to 3.2GHz, but they are all sold out Sad People tend to buy then like warm bread Smile
Post 21 Apr 2006, 05:35
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 01 May 2006, 16:26
Hi guys,

Reading Iczelion's Win32 MultiThreading Tutorial (http://win32assembly.online.fr/tut15.html) I tried to invoke the stuff somehow on my benchmark to see a benefit of the dual core system. The multithreading seems to work, but now it's really slow on single core, about 40 percent of performance loss.
You can get it from:
www.mikusite.de/x86/kmb_mt_test.zip

Can anybody who has a dual core test it and see any improvements compared to the original code ?
Or even have a look on the code ?

I think I didn't manage to invoke it correctly. In my inner loop for each frame of the fractal I set up a thread for upper and lower half of the screen. But to check the end I just came up with a dummy-kind-of-loop (the finish dummy is increased from the upper_half or the lower_half routine):

Code:
   frame_counter_loop:
   ...
         mov   eax,thread_procedure_upper_half
         invoke CreateThread,NULL,0,eax,NULL,REALTIME_PRIORITY_CLASS,tId

         mov   eax,thread_procedure_lower_half
         invoke CreateThread,NULL,0,eax,NULL,REALTIME_PRIORITY_CLASS,tId

         wait_until_threads_are_finsihed:
           cmp [finish],2
           jne wait_until_threads_are_finsihed
           mov [finish],0

     dec [frame_counter]
     jns frame_counter_loop
    


I think this should be more done in a Window Message loop that is looking for the end thread...!?
Post 01 May 2006, 16:26
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 01 May 2006, 19:56
AMD64x2 4400+ - all four memory banks loaded but still running DDR400 speed, but fortunately not having system instability (a tiny bit slower than two banks, though... yay for bugged on-cpu memory controllers to drag down an otherwise nice chip).

Results are the best of three runs:

Quote:

Kümmel Mandelbrot Benchmark V 0.3 FPU: Speed [million iterations / second] : 142.964

Kümmel Mandelbrot Benchmark V 0.3 SSE2: Speed [million iterations / second] : 194.400
Post 01 May 2006, 19:56
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 01 May 2006, 19:59
I'm getting 195fps on the multithread version.

Some tips: SetThreadAffinityMask(), limit each thread to separate cores (not *really* needed especially with timecritical priority, but can still help a bit against cache trashing). Also, to check whether threads are done, use WaitForMultipleObjectsEx on the thread handle values... faster than your manual sync.


Last edited by f0dder on 01 May 2006, 20:21; edited 2 times in total
Post 01 May 2006, 19:59
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 01 May 2006, 20:06
Dualcore people, check this out... I get ~256fps now Smile

PS: (and now finally as an edit instead of a new post, doh!) the upper/lower threads are *very* equal in their code. It would be better to pass the threads an initialization struct with limits, and use local instead of global variables. That would also make it easier to expand this to 'n' threads, for people with SMP systems, SMP systems with dualcore CPUs, or SMP systems with dualcore+ht CPUs Razz


Description:
Download
Filename: f0dder_kmb.zip
Filesize: 6.21 KB
Downloaded: 406 Time(s)


_________________
Image - carpe noctem
Post 01 May 2006, 20:06
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 01 May 2006, 21:14
Hey fOdder !

WOW, it basicall seems to work !

The upper/lower threads are in fact almost the same, I just thought it might me better to get both it's own set of variables, that's why I created them in this way. The basic problem is, as the the upper and lower part of the screen might take different times to compute anyway, it might be overall better to give each thread it's single pixel to compute...instead of creating just two threads per frame...I try on experimenting with this also !

I'll try to give each calculated line it's thread, this should limit that some core doesn't have something to do, like now...I'm sure then we can gain even more, almost double speed, I suspect !

Now I check out your version...

hey...great...there's no more penalty for my single core Sempron ! Looks good for a next verions of my Benchmark due to you Smile
Post 01 May 2006, 21:14
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 02 May 2006, 07:55
Quote:

The upper/lower threads are in fact almost the same, I just thought it might me better to get both it's own set of variables, that's why I created them in this way.

Local (stack) variables will solve this 100% automatically.

Quote:

The basic problem is, as the the upper and lower part of the screen might take different times to compute anyway, it might be overall better to give each thread it's single pixel to compute...instead of creating just two threads per frame...I try on experimenting with this also !

It probably won't be worth it - threads are pretty "cheap", but more than one thread per core most likely won't do you any good, and can even harm performance.

I'll try plaing around a bit and see what I can come up with Smile

EDIT: updated code, only has one thread body now. Removed a lot of superfluous dword/qword specifiers, and reformatted (4tab indent etc.) the code. Runs 5-10fps faster on my system than my previous edit...


Description:
Download
Filename: f0dder_kmb_2.zip
Filesize: 7.78 KB
Downloaded: 387 Time(s)

Post 02 May 2006, 07:55
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 02 May 2006, 10:06
Here's a new version that can do both FPU and SSE2 multi-threaded (compile-time option). Just for fun, I assembled a version where one thread does FPU and another SSE2 - pretty useless, but oh well Smile

I get around 360fps from the SSE2-mt version.

Might be worthwhile using the write-through-cache instructions when storing since you're drawing directly to video memory?


Description:
Download
Filename: f0dder_kmb_3.zip
Filesize: 12.9 KB
Downloaded: 387 Time(s)


_________________
Image - carpe noctem
Post 02 May 2006, 10:06
View user's profile Send private message Visit poster's website Reply with quote
GMAN



Joined: 02 May 2006
Posts: 5
Location: UK
GMAN 02 May 2006, 11:01
Exclamation Aha. The Release and Kümmel versions of IF.INC and WIN32A.INC are different.

In particular, K's WIN32A calls macros 'proc.inc','com.inc' and 'import.inc' whereas the release version calls the version with 32 on the end. That seems strange. The crash takes place at

Code:
cominvk DDraw, Release        
(line 303-ish)

I'll explore further, starting with a diff of COM.INC and COM32.INC.

The IF.INC files are hugely different - Release is 9K, K's version is 2K. What's the difference between JCOND and JNCOND?

(Not that it seems to make any difference to this program, but to my NooB ear those two sound like they do the opposite thing...)

& Finally... Celeron 2.4, 512MB Ram @ 266MHz:

    SSE2 - 211.843 (single and f0dder's multithreaded version)
    Mixed Very Happy 127.781
    FPU - 88.5 (single thread and f0dder's multithreaded version)


[EDIT] - The problem is in the difference between the proc and proc32 macro libraries. If I had an IQ in three figures I could probably work out just what that difference is. The Windiff of those two files is just boggling
Post 02 May 2006, 11:01
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 02 May 2006, 13:22
I get 176 on my single core!! the same than the original code. You should optimized the code very much then Very Happy
Post 02 May 2006, 13:22
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 02 May 2006, 14:09
I'm quite amazed if you're getting same speeds with the multithreaded as the singlethreaded versions! Might have something to do with the thread priorities, hmm.
Post 02 May 2006, 14:09
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 02 May 2006, 14:33
Same speed than Kuemmel's code, but since you made some optimizations maybe that compensates the overhead caused by using two threads on a single core computer.

Another test that can be great to do is not writing to video memory anymore, I think that can be a problem with write combining, specially with my core (Venice DH-E3) which has a bug with that.

Regards

PS: The bug
Revision Guide for AMD AthlonTM 64 and AMD OpteronTM processors wrote:

113 Enhanced Write-Combining Feature Causes System Hang
Description
The enhanced write-combining feature provides up to four write-combining buffers, but a potential
stall condition can occur when write combining into all four buffers with this feature enabled.
Potential Effect on System
System hang.
Suggested Workaround
Disable the enhanced write-combining feature by setting BU_CFG.WbEnhWsbDis (bit 48 of MSR
C001_1023h). This reduces the number of available write-combining buffers from four to one.
Fix Planned
Yes
Post 02 May 2006, 14:33
View user's profile Send private message Reply with quote
madmatt



Joined: 07 Oct 2003
Posts: 1045
Location: Michigan, USA
madmatt 02 May 2006, 17:18
Hello everyone, here are my latest results [Celeron PIV 2.7ghz w/hyperthreading]
r22 -> 221.4
fodder V1 -> 90.1
fodder V2 -> 98.704
fodder v3 -> 237.045 Exclamation (I'll have to remember that real time priority trick!)
Kummel -> 58.607 Question
Post 02 May 2006, 17:18
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 02 May 2006, 17:21
madmatt, is the "f0dder v3" the FPU or SSE2 version? My guess is SSE2, since you get that high speed Smile
Post 02 May 2006, 17:21
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 02 May 2006, 17:47
Hi fOdder ! Nice work ! I can learn a lot from that !

I still think we can even gain more in rearranging the calculation like this, due to nature of the fractal itself. I think on some frames (almost all, if not totally symetric) now the thread for the first half has to wait for the second half to finish calculating. Especially that's the fact more or less for all kind of frames I've choosen.

So I propose not to do it like upper/lower half (2 threads) per frame, we could try to do it like puting the threading in the 'plot_y' - loop:

screen line 0 : thread1
screen line 1 : thread2
wait for them to finish
screen line 2 : thread1
screen line 3 : thread2
wait for them to finish
...and so on...as two next screen lines of the fractal are 'more' equal in computation time this will boost the computation even more...if these much more threads (300*2 lines = 600 threads per frame) don't create more administration overhead, that's what I don't know.

I think it's really worth a try !

I just might not have time this week to implement it...
Post 02 May 2006, 17:47
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 02 May 2006, 18:03
Yes, you could do scan-line interleaving - that might be worth a try. But you do NOT want too many "wait for them to finish", as that's a relatively expensive operation - either you need a user->kernel->usermode switch (exxxxpennnnsive), or you need a usermode spinlock (burns CPU cycles for just about nothing).

And you really don't want more threads than you have processors (or rather "logical processors", to cover SMP, multicore, multithreading in one term). The problem with more threads is that you'll get more context switches, and those are relatively expensive (although not *extremely* expensive as long as it's threads within a single process - switching to a thread in another process is more expensive because of pagetable issues).

A thing that would might be worth it is to create the two worker threads only once, and re-use them. This does complicate the program logic, though, as you need to signal "start drawing frame", then wait for the frame to be drawn, etc.

EDIT: As it is now, I think the biggest gain will be from optimizing the actual drawing algorithm - the threading stuff is relatively solid, and schemes like "create threads once and reuse" requires more complex logic and might not be faster in the long run. Scan-line interleave would be worth checking out, though.
Post 02 May 2006, 18:03
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3, ... 18, 19, 20  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.