flat assembler
Message board for the users of flat assembler.
Index
> Windows > Mandelbrot Benchmark FPU/SSE2 released Goto page Previous 1, 2, 3, 4 ... 18, 19, 20 Next |
Author |
|
GMAN 02 May 2006, 22:33
In my humble opinion the best route is
a) separate threads for adjacent pixels, (number of processors and hence threads to be determined at run time?) b) I think locodelassembly hints at using offscreen drawing and DirectX BitBlit, for instance, to transfer the image to screen in one large spoonful. There's the Russian chap, Peter Kankowski, who built an SSE/SSE2 engine which allows screen manipulation and zooming in realtime. (code available) Link: http://www.codeproject.com/cpp/fractalssse.asp c) There are numerous ways of building the algorithm which reduce the amount of brute force processing required; in my time I've tinkered with both statistical and edge finder methods for fractal drawing; but the best of all is a derivative of the statistical method which involves a simple assumption about the behaviour of exponentiated numbers. More to follow... |
|||
02 May 2006, 22:33 |
|
madmatt 03 May 2006, 01:30
f0dder: looks like the sse2 version.
Quote: Kuemmel: First, forgot to mention that I have only single cpu core (Celeron PIV) running at 2.7ghz, with 1GB ddr sdram, 266mhz RAM I think. I not sure whether its a better algorithm, hyperthreading, Improved Celeron performance, better CPU -> VIDEOMEMORY accsess, or combinations. When I ran your original code I got:SSE2 -> 223.737, not to far from f0dders results |
|||
03 May 2006, 01:30 |
|
Kuemmel 03 May 2006, 05:35
GMAN wrote: In my humble opinion the best route is Hi GMAN, about b) Actually I'm using Peter Kankowskis algorithm already, it's his MOV-ADD version for it, I refer to it in my credits ! He also got other versions of the code where you gain some small percentage for Pentium-M/Core Architecture. About offscreen drawing...I don't expect much of it, at least on my AMD's the spent penalty for drawing the fractal overall is very low, less than 1 or 2 percent of total time of computation. c) That's true of course. On my old Acorn Risc Machine I implemented a boundary tracing algorithm, like checking squares and filling them if they are 'black' (=maximum iterations)...but here my main focus was on simply checking out the different FPU's of AMD and INTEL cpu's of current and the past...for a true mandelbrot app there's really a lot of optimization to think about, of course. |
|||
03 May 2006, 05:35 |
|
LocoDelAssembly 03 May 2006, 12:08
Code: ; mov ebx,[ddsd.lpSurface] ;get address to surface invoke VirtualAlloc, NULL, $1000000, MEM_COMMIT, PAGE_READWRITE mov ebx, eax Even with that I get 176 like always |
|||
03 May 2006, 12:08 |
|
LocoDelAssembly 03 May 2006, 13:33
I get up to 182 with this "blind" version
|
|||
03 May 2006, 13:33 |
|
GMAN 03 May 2006, 15:21
Kuemmel wrote:
Quote: About offscreen drawing...I don't expect much of it, at least on my AMD's the spent penalty for drawing the fractal overall is very low, less than 1 or 2 percent of total time of computation. Quote:
If I ever manage to get the hang of assembly writing I'll post some of my brainfruit for everyone's delectation. I have one written in VB with a VC6 DLL that uses arbitrary powers of 'n' for |z|=|z^n|+|c|. It's incredibly slow in FPU, though it does produce fantastic animations. First job. I'll try and modify that to use SSE and see how it performs |
|||
03 May 2006, 15:21 |
|
Kuemmel 03 May 2006, 19:35
Hi fOdder or anybody else with a Dual Core...
just to prove myself wrong (or right ) I modified fOdders code so that I can decide how much lines each thread calculates and waits then for the other. You can find it on www.mikusite.de/x86/KMB_lines_test.zip There are 4 versions (SSE2) to test: 1 line per thread 5 lines per thread 30 lines per thread 300 lines per thread (this should be the same like fOdders original) You can experiment yourself with other values. Just change the value 'LINES' in the beginning to any other number. The number must give full numbers when you divide 600 by it. Can you post me some timings on them for your dual cores ? On my single core I can see a slow down for 1 and 5 lines due to the threading overhead, but still I hope the calculation overall benefits on the dual core due to other waiting times for the threads on special non symetric mandelbrot frames...but I can be wrong also... |
|||
03 May 2006, 19:35 |
|
madmatt 03 May 2006, 19:56
Do any of these tests draw in system memory then copies to ddraw video memory? I wonder how this would effect the timings.
|
|||
03 May 2006, 19:56 |
|
f0dder 03 May 2006, 19:57
Interesting... and doh, hadn't thought about the assymetric nature of the sets.
Code: 1line: 350fps 5line: 375fps 10line: 375fps 30line: 357fps 300line: 360fps Hm. So there's an advantage in having a finer "granularity" for rendering. 5-10 lines seem good - you don't want to go lower anyway because there'll be too much overhead per line then. Having to create/destroy threads all the time isn't nice, though. It should be possible to set up one thread for each CPU (or stick with 2 threads as now), and still have the main code only create threads once and do the WaitForXxx. The threads would then handle the scan-line interleaving manually. We could probably use "lock xadd" to get-and-update the current line draw-offset in an atomic way, without having to use more-expensive user->kernel->user transitions and synchronization objects. |
|||
03 May 2006, 19:57 |
|
LocoDelAssembly 03 May 2006, 20:21
f0dder, do you mean this?
Code: threadLoop: mov eax, 1 lock xadd [scanline], eax cmp eax, MAX_SCANLINE+1 je exitThreadLoop ; Work on the scanline that EAX says jmp threadLoop |
|||
03 May 2006, 20:21 |
|
f0dder 03 May 2006, 20:33
No loco, check this upload (but it was close).
Performance when configured for 5-line interleave: Code: fpu: 285 fps sse2: 395 fps EDIT: whoa, it just occured to me that this is pretty close to perfect parallelism! The singlethread version gave me 142/194 for fpu/sse2. My v5 code gives 285/395fps with two threads and 5-line interleave. That's very close to 2x original speed.
_________________ - carpe noctem |
|||||||||||
03 May 2006, 20:33 |
|
LocoDelAssembly 03 May 2006, 20:56
178-179 on my single core, nice
I want a dual-core |
|||
03 May 2006, 20:56 |
|
Kuemmel 03 May 2006, 21:21
f0dder wrote: EDIT: whoa, it just occured to me that this is pretty close to perfect parallelism! The singlethread version gave me 142/194 for fpu/sse2. My v5 code gives 285/395fps with two threads and 5-line interleave. That's very close to 2x original speed. Great !!! I didn't even expect that it will go to 200%, before it was like 180% if I'm right...perfect...joint forces and ideas seem to work . I checked it also on my single core...no penalty at 5 lines ! Would you mind if I make a new public release (with credits of course) on my webapge for the KMB based on your last version...or still other ideas in mind ? P.S.: Now I got to look up the 'lock' command...never heard of that...! Man, it was so easy to have a simple RISC instruction set on my old StrongARM Acorn Risc PC |
|||
03 May 2006, 21:21 |
|
f0dder 03 May 2006, 21:26
I don't have any other ideas right now. The routines could probably be cleaned up though, go through local variables and register usage. Would also be nice if a little more "decoupling" was done, so more code could be reused in the threads (even if only as a macro) - I mean, once I had fixed up the SSE2 version, I had to make the exact same changes to the FPU version.
In the future, some runtime control of amount of threads, line interleave value etc. might be interesting, but that's more work. And sure, public release is just fine |
|||
03 May 2006, 21:26 |
|
GMAN 10 May 2006, 02:49
OK. Latest fractal benchmark test.
SSE2 211.843 FPU 88.318 The measurement seems to be very consistent from run to run |
|||
10 May 2006, 02:49 |
|
f0dder 10 May 2006, 12:12
Quote:
They really should be, since the process/thread priorities are boosted to way high, to not let other programs interfere too much with the results. This is okay for short-running tests, but shouldn't be done for "real-world" code, btw |
|||
10 May 2006, 12:12 |
|
Kuemmel 10 May 2006, 14:42
What's the most interesting result at the moment, besides the almost perfect parallelisation, is the result from Madis731 with his single core P4 Prescott with Hyper Threading:
New KMB 0.5 MT: 174,009 / 321,134 Old KMB 0.3: 87,704 / 212,521 The MT version doubled the (very bad) FPU performance and mutliplied the SSE2 version by 1.5 ! Is it really the hyper threading !? (I read on wikipedia the specs and it says there's also a raised pipeline from 20 to 30 stages)...amazing...I wonder what a Pentium Extreme Dual Core with Hypethreading can do...per MHZ the values above are still more bad than AMD/P3 FPU but the SSE2 is now 20% better. The effect didn't take place until now with any P4/Celeron without hyper-threading... I will include his results on my homepage later in the evening. |
|||
10 May 2006, 14:42 |
|
f0dder 10 May 2006, 14:46
Hm, that is very interesting indeed, since HyperThreading can sometimes make performance *worse* when enabled. I find it very interesting in this situation because the two threads are *both* doing floating-point stuff; the best situation for HT (as far as I have understood), a bit simplified, is one thread doing FPU and another doing Integer.
|
|||
10 May 2006, 14:46 |
|
Kuemmel 10 May 2006, 15:05
...and it even seems to be so clearly multiplied the performance...like *2,0 for FPU and *1,5 for SSE2...may it can fill the stalled pipeline somehow with the other thread (just speculating...) !?
@Madis731...may be you can post some results with activated and disabled hyperthreading with KMB 0.5 MT ? (I don't know if it is possible to switch it off, but I thought I heard it is...) |
|||
10 May 2006, 15:05 |
|
Goto page Previous 1, 2, 3, 4 ... 18, 19, 20 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.