flat assembler
Message board for the users of flat assembler.
Index
> Windows > Mandelbrot Benchmark FPU/SSE2 released Goto page Previous 1, 2, 3, ... 18, 19, 20 Next |
Author |
|
vbVeryBeginner 20 Apr 2006, 01:14
here is my cpu info
http://sulaiman.netadvant.com/fasm/cpu.png support, MMX,SSE,SSE2,SSE3,EM64T, the EM64T sounds like it could emulates 64bits instruction? i just google for winxp 64bits, it seems that http://www.microsoft.com/windowsxp/64bit/facts/trial.mspx Supported processors: AMD Athlon 64, AMD Opteron, Intel Xeon with Intel EM64T support, Intel Pentium 4 with Intel EM64T support maybe i am in good luck? to execute 64bits winxp? |
|||
20 Apr 2006, 01:14 |
|
r22 20 Apr 2006, 02:44
I rewrote the FPU code in the SSE source to only use SSE. There was NO performance gain. I think the 32bit DirectDraw WOWing on my WinXP64 is hindering my results. I'd port it to 64bit but I don't have any example/includes of DirectDraw for Win64.
By using SSE the source can be optimized because the XMMX registers retain the values unlike the FPU stack.
|
|||||||||||
20 Apr 2006, 02:44 |
|
LocoDelAssembly 20 Apr 2006, 03:00
r22, your version is a little worst, now I have 173 instead of 176. I'm using WinXP32 SP2 (but my CPU is an athlon64 2 GHz).
|
|||
20 Apr 2006, 03:00 |
|
Madis731 21 Apr 2006, 05:35
I've got Windows 2000 on both of the machines I've tried on - P!!! and P4 I think I can manage to find a W98 machine if I try. I've had some P4 Dual Cores from 2.8 to 3.2GHz, but they are all sold out People tend to buy then like warm bread
|
|||
21 Apr 2006, 05:35 |
|
Kuemmel 01 May 2006, 16:26
Hi guys,
Reading Iczelion's Win32 MultiThreading Tutorial (http://win32assembly.online.fr/tut15.html) I tried to invoke the stuff somehow on my benchmark to see a benefit of the dual core system. The multithreading seems to work, but now it's really slow on single core, about 40 percent of performance loss. You can get it from: www.mikusite.de/x86/kmb_mt_test.zip Can anybody who has a dual core test it and see any improvements compared to the original code ? Or even have a look on the code ? I think I didn't manage to invoke it correctly. In my inner loop for each frame of the fractal I set up a thread for upper and lower half of the screen. But to check the end I just came up with a dummy-kind-of-loop (the finish dummy is increased from the upper_half or the lower_half routine): Code: frame_counter_loop: ... mov eax,thread_procedure_upper_half invoke CreateThread,NULL,0,eax,NULL,REALTIME_PRIORITY_CLASS,tId mov eax,thread_procedure_lower_half invoke CreateThread,NULL,0,eax,NULL,REALTIME_PRIORITY_CLASS,tId wait_until_threads_are_finsihed: cmp [finish],2 jne wait_until_threads_are_finsihed mov [finish],0 dec [frame_counter] jns frame_counter_loop I think this should be more done in a Window Message loop that is looking for the end thread...!? |
|||
01 May 2006, 16:26 |
|
f0dder 01 May 2006, 19:56
AMD64x2 4400+ - all four memory banks loaded but still running DDR400 speed, but fortunately not having system instability (a tiny bit slower than two banks, though... yay for bugged on-cpu memory controllers to drag down an otherwise nice chip).
Results are the best of three runs: Quote:
|
|||
01 May 2006, 19:56 |
|
f0dder 01 May 2006, 19:59
I'm getting 195fps on the multithread version.
Some tips: SetThreadAffinityMask(), limit each thread to separate cores (not *really* needed especially with timecritical priority, but can still help a bit against cache trashing). Also, to check whether threads are done, use WaitForMultipleObjectsEx on the thread handle values... faster than your manual sync. Last edited by f0dder on 01 May 2006, 20:21; edited 2 times in total |
|||
01 May 2006, 19:59 |
|
f0dder 01 May 2006, 20:06
Dualcore people, check this out... I get ~256fps now
PS: (and now finally as an edit instead of a new post, doh!) the upper/lower threads are *very* equal in their code. It would be better to pass the threads an initialization struct with limits, and use local instead of global variables. That would also make it easier to expand this to 'n' threads, for people with SMP systems, SMP systems with dualcore CPUs, or SMP systems with dualcore+ht CPUs
_________________ - carpe noctem |
|||||||||||
01 May 2006, 20:06 |
|
Kuemmel 01 May 2006, 21:14
Hey fOdder !
WOW, it basicall seems to work ! The upper/lower threads are in fact almost the same, I just thought it might me better to get both it's own set of variables, that's why I created them in this way. The basic problem is, as the the upper and lower part of the screen might take different times to compute anyway, it might be overall better to give each thread it's single pixel to compute...instead of creating just two threads per frame...I try on experimenting with this also ! I'll try to give each calculated line it's thread, this should limit that some core doesn't have something to do, like now...I'm sure then we can gain even more, almost double speed, I suspect ! Now I check out your version... hey...great...there's no more penalty for my single core Sempron ! Looks good for a next verions of my Benchmark due to you |
|||
01 May 2006, 21:14 |
|
f0dder 02 May 2006, 07:55
Quote:
Local (stack) variables will solve this 100% automatically. Quote:
It probably won't be worth it - threads are pretty "cheap", but more than one thread per core most likely won't do you any good, and can even harm performance. I'll try plaing around a bit and see what I can come up with EDIT: updated code, only has one thread body now. Removed a lot of superfluous dword/qword specifiers, and reformatted (4tab indent etc.) the code. Runs 5-10fps faster on my system than my previous edit...
|
|||||||||||
02 May 2006, 07:55 |
|
f0dder 02 May 2006, 10:06
Here's a new version that can do both FPU and SSE2 multi-threaded (compile-time option). Just for fun, I assembled a version where one thread does FPU and another SSE2 - pretty useless, but oh well
I get around 360fps from the SSE2-mt version. Might be worthwhile using the write-through-cache instructions when storing since you're drawing directly to video memory?
_________________ - carpe noctem |
|||||||||||
02 May 2006, 10:06 |
|
GMAN 02 May 2006, 11:01
Aha. The Release and Kümmel versions of IF.INC and WIN32A.INC are different.
In particular, K's WIN32A calls macros 'proc.inc','com.inc' and 'import.inc' whereas the release version calls the version with 32 on the end. That seems strange. The crash takes place at Code: cominvk DDraw, Release I'll explore further, starting with a diff of COM.INC and COM32.INC. The IF.INC files are hugely different - Release is 9K, K's version is 2K. What's the difference between JCOND and JNCOND? (Not that it seems to make any difference to this program, but to my NooB ear those two sound like they do the opposite thing...) & Finally... Celeron 2.4, 512MB Ram @ 266MHz: SSE2 - 211.843 (single and f0dder's multithreaded version) Mixed 127.781 FPU - 88.5 (single thread and f0dder's multithreaded version) [EDIT] - The problem is in the difference between the proc and proc32 macro libraries. If I had an IQ in three figures I could probably work out just what that difference is. The Windiff of those two files is just boggling |
|||
02 May 2006, 11:01 |
|
LocoDelAssembly 02 May 2006, 13:22
I get 176 on my single core!! the same than the original code. You should optimized the code very much then
|
|||
02 May 2006, 13:22 |
|
f0dder 02 May 2006, 14:09
I'm quite amazed if you're getting same speeds with the multithreaded as the singlethreaded versions! Might have something to do with the thread priorities, hmm.
|
|||
02 May 2006, 14:09 |
|
LocoDelAssembly 02 May 2006, 14:33
Same speed than Kuemmel's code, but since you made some optimizations maybe that compensates the overhead caused by using two threads on a single core computer.
Another test that can be great to do is not writing to video memory anymore, I think that can be a problem with write combining, specially with my core (Venice DH-E3) which has a bug with that. Regards PS: The bug Revision Guide for AMD AthlonTM 64 and AMD OpteronTM processors wrote:
|
|||
02 May 2006, 14:33 |
|
madmatt 02 May 2006, 17:18
Hello everyone, here are my latest results [Celeron PIV 2.7ghz w/hyperthreading]
r22 -> 221.4 fodder V1 -> 90.1 fodder V2 -> 98.704 fodder v3 -> 237.045 (I'll have to remember that real time priority trick!) Kummel -> 58.607 |
|||
02 May 2006, 17:18 |
|
f0dder 02 May 2006, 17:21
madmatt, is the "f0dder v3" the FPU or SSE2 version? My guess is SSE2, since you get that high speed
|
|||
02 May 2006, 17:21 |
|
Kuemmel 02 May 2006, 17:47
Hi fOdder ! Nice work ! I can learn a lot from that !
I still think we can even gain more in rearranging the calculation like this, due to nature of the fractal itself. I think on some frames (almost all, if not totally symetric) now the thread for the first half has to wait for the second half to finish calculating. Especially that's the fact more or less for all kind of frames I've choosen. So I propose not to do it like upper/lower half (2 threads) per frame, we could try to do it like puting the threading in the 'plot_y' - loop: screen line 0 : thread1 screen line 1 : thread2 wait for them to finish screen line 2 : thread1 screen line 3 : thread2 wait for them to finish ...and so on...as two next screen lines of the fractal are 'more' equal in computation time this will boost the computation even more...if these much more threads (300*2 lines = 600 threads per frame) don't create more administration overhead, that's what I don't know. I think it's really worth a try ! I just might not have time this week to implement it... |
|||
02 May 2006, 17:47 |
|
f0dder 02 May 2006, 18:03
Yes, you could do scan-line interleaving - that might be worth a try. But you do NOT want too many "wait for them to finish", as that's a relatively expensive operation - either you need a user->kernel->usermode switch (exxxxpennnnsive), or you need a usermode spinlock (burns CPU cycles for just about nothing).
And you really don't want more threads than you have processors (or rather "logical processors", to cover SMP, multicore, multithreading in one term). The problem with more threads is that you'll get more context switches, and those are relatively expensive (although not *extremely* expensive as long as it's threads within a single process - switching to a thread in another process is more expensive because of pagetable issues). A thing that would might be worth it is to create the two worker threads only once, and re-use them. This does complicate the program logic, though, as you need to signal "start drawing frame", then wait for the frame to be drawn, etc. EDIT: As it is now, I think the biggest gain will be from optimizing the actual drawing algorithm - the threading stuff is relatively solid, and schemes like "create threads once and reuse" requires more complex logic and might not be faster in the long run. Scan-line interleave would be worth checking out, though. |
|||
02 May 2006, 18:03 |
|
Goto page Previous 1, 2, 3, ... 18, 19, 20 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.