flat assembler
Message board for the users of flat assembler.

Index > Windows > Mandelbrot Benchmark FPU/SSE2 released

Goto page Previous  1, 2, 3, 4 ... 18, 19, 20  Next
Author
Thread Post new topic Reply to topic
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel
Some more timings from my single cores with your Version 3:

Athlon 1000 MHz:
KMB (original version) FPU: 65,324
fOdder3fpu: 65,711

Sempron 1800 MHz
KMB (original version) FPU: 117,865
KMB (original version) SSE: 161,422
fOdder3fpu: 118,299
fOdder3SSE: 161,031

...so no real difference for the single core ! The extra code for the threading doesn't touch the results, that's nice !

@madmatt...I wonder about your results...does the threading or the other improvements only push the Intel cores forward ? May be the hyper threading ? Getting confused...
Post 02 May 2006, 20:36
View user's profile Send private message Visit poster's website Reply with quote
GMAN



Joined: 02 May 2006
Posts: 5
Location: UK
GMAN
In my humble opinion the best route is
a) separate threads for adjacent pixels, (number of processors and hence threads to be determined at run time?)
b) I think locodelassembly hints at using offscreen drawing and DirectX BitBlit, for instance, to transfer the image to screen in one large spoonful. There's the Russian chap, Peter Kankowski, who built an SSE/SSE2 engine which allows screen manipulation and zooming in realtime. (code available) Link: http://www.codeproject.com/cpp/fractalssse.asp
c) There are numerous ways of building the algorithm which reduce the amount of brute force processing required; in my time I've tinkered with both statistical and edge finder methods for fractal drawing; but the best of all is a derivative of the statistical method which involves a simple assumption about the behaviour of exponentiated numbers. More to follow... Wink
Post 02 May 2006, 22:33
View user's profile Send private message Reply with quote
madmatt



Joined: 07 Oct 2003
Posts: 1045
Location: Michigan, USA
madmatt
f0dder: looks like the sse2 version.

Quote:
Kuemmel:
@madmatt...I wonder about your results...does the threading or the other improvements only push the Intel cores forward ? May be the hyper threading ? Getting confused...


First, forgot to mention that I have only single cpu core (Celeron PIV) running at 2.7ghz, with 1GB ddr sdram, 266mhz RAM I think. I not sure whether its a better algorithm, hyperthreading, Improved Celeron performance, better CPU -> VIDEOMEMORY accsess, or combinations. When I ran your original code I got:SSE2 -> 223.737, not to far from f0dders results
Post 03 May 2006, 01:30
View user's profile Send private message Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel
GMAN wrote:
In my humble opinion the best route is
a) separate threads for adjacent pixels, (number of processors and hence threads to be determined at run time?)
b) I think locodelassembly hints at using offscreen drawing and DirectX BitBlit, for instance, to transfer the image to screen in one large spoonful. There's the Russian chap, Peter Kankowski, who built an SSE/SSE2 engine which allows screen manipulation and zooming in realtime. (code available) Link: http://www.codeproject.com/cpp/fractalssse.asp
c) There are numerous ways of building the algorithm which reduce the amount of brute force processing required; in my time I've tinkered with both statistical and edge finder methods for fractal drawing; but the best of all is a derivative of the statistical method which involves a simple assumption about the behaviour of exponentiated numbers. More to follow... Wink

Hi GMAN, about
b) Actually I'm using Peter Kankowskis algorithm already, it's his MOV-ADD version for it, I refer to it in my credits ! He also got other versions of the code where you gain some small percentage for Pentium-M/Core Architecture.
About offscreen drawing...I don't expect much of it, at least on my AMD's the spent penalty for drawing the fractal overall is very low, less than 1 or 2 percent of total time of computation.

c) That's true of course. On my old Acorn Risc Machine I implemented a boundary tracing algorithm, like checking squares and filling them if they are 'black' (=maximum iterations)...but here my main focus was on simply checking out the different FPU's of AMD and INTEL cpu's of current and the past...for a true mandelbrot app there's really a lot of optimization to think about, of course.
Post 03 May 2006, 05:35
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Code:
;     mov     ebx,[ddsd.lpSurface]                ;get address to surface
     invoke  VirtualAlloc, NULL, $1000000, MEM_COMMIT, PAGE_READWRITE
     mov     ebx, eax
    


Even with that I get 176 like always Sad
Post 03 May 2006, 12:08
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
I get up to 182 with this "blind" version
Post 03 May 2006, 13:33
View user's profile Send private message Reply with quote
GMAN



Joined: 02 May 2006
Posts: 5
Location: UK
GMAN
Kuemmel wrote:

Hi GMAN, about b) Actually I'm using Peter Kankowskis algorithm already
Derr! (GMan Slaps Forehead) His version only uses 64 iterations out of the box. When I upped it to 512 it was a lot slower
Quote:
About offscreen drawing...I don't expect much of it, at least on my AMD's the spent penalty for drawing the fractal overall is very low, less than 1 or 2 percent of total time of computation.
Hmm. The blind SSE version scored exactly the same on my rig
Quote:

c) That's true of course. On my old Acorn Risc Machine I implemented a boundary tracing algorithm, like checking squares and filling them if they are 'black' (=maximum iterations)...but here my main focus was on simply checking out the different FPU's of AMD and INTEL cpu's of current and the past...for a true mandelbrot app there's really a lot of optimization to think about, of course.
Ah the good old days. Mine was on an Atari with Hisoft Basic. Fair enough point, K├╝mmel.

If I ever manage to get the hang of assembly writing I'll post some of my brainfruit for everyone's delectation. I have one written in VB with a VC6 DLL that uses arbitrary powers of 'n' for |z|=|z^n|+|c|. It's incredibly slow in FPU, though it does produce fantastic animations. First job. I'll try and modify that to use SSE and see how it performs
Post 03 May 2006, 15:21
View user's profile Send private message Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel
Hi fOdder or anybody else with a Dual Core...

just to prove myself wrong (or right Wink) I modified fOdders code so that I can decide how much lines each thread calculates and waits then for the other. You can find it on
www.mikusite.de/x86/KMB_lines_test.zip

There are 4 versions (SSE2) to test:
1 line per thread
5 lines per thread
30 lines per thread
300 lines per thread (this should be the same like fOdders original)

You can experiment yourself with other values. Just change the value 'LINES' in the beginning to any other number. The number must give full numbers when you divide 600 by it.

Can you post me some timings on them for your dual cores ?

On my single core I can see a slow down for 1 and 5 lines due to the threading overhead, but still I hope the calculation overall benefits on the dual core due to other waiting times for the threads on special non symetric mandelbrot frames...but I can be wrong also...
Post 03 May 2006, 19:35
View user's profile Send private message Visit poster's website Reply with quote
madmatt



Joined: 07 Oct 2003
Posts: 1045
Location: Michigan, USA
madmatt
Do any of these tests draw in system memory then copies to ddraw video memory? I wonder how this would effect the timings.
Post 03 May 2006, 19:56
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
Interesting... and doh, hadn't thought about the assymetric nature of the sets.

Code:
1line:   350fps
5line:   375fps
10line:  375fps
30line:  357fps
300line: 360fps
    


Hm. So there's an advantage in having a finer "granularity" for rendering. 5-10 lines seem good - you don't want to go lower anyway because there'll be too much overhead per line then. Having to create/destroy threads all the time isn't nice, though.

It should be possible to set up one thread for each CPU (or stick with 2 threads as now), and still have the main code only create threads once and do the WaitForXxx. The threads would then handle the scan-line interleaving manually.

We could probably use "lock xadd" to get-and-update the current line draw-offset in an atomic way, without having to use more-expensive user->kernel->user transitions and synchronization objects.
Post 03 May 2006, 19:57
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
f0dder, do you mean this?

Code:
threadLoop:
  mov eax, 1
  lock xadd [scanline], eax
  cmp eax, MAX_SCANLINE+1
  je    exitThreadLoop
  
; Work on the scanline that EAX says

jmp  threadLoop

    
Post 03 May 2006, 20:21
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
No loco, check this upload Smile (but it was close).

Performance when configured for 5-line interleave:
Code:
fpu:  285 fps
sse2: 395 fps
    


EDIT: whoa, it just occured to me that this is pretty close to perfect parallelism! The singlethread version gave me 142/194 for fpu/sse2. My v5 code gives 285/395fps with two threads and 5-line interleave. That's very close to 2x original speed.


Description:
Download
Filename: f0dder_kmb_5.zip
Filesize: 10.94 KB
Downloaded: 166 Time(s)


_________________
Image - carpe noctem
Post 03 May 2006, 20:33
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
178-179 on my single core, nice Very Happy

I want a dual-core Sad
Post 03 May 2006, 20:56
View user's profile Send private message Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel
f0dder wrote:
EDIT: whoa, it just occured to me that this is pretty close to perfect parallelism! The singlethread version gave me 142/194 for fpu/sse2. My v5 code gives 285/395fps with two threads and 5-line interleave. That's very close to 2x original speed.

Great !!! I didn't even expect that it will go to 200%, before it was like 180% if I'm right...perfect...joint forces and ideas seem to work Wink. I checked it also on my single core...no penalty at 5 lines !

Would you mind if I make a new public release (with credits of course) on my webapge for the KMB based on your last version...or still other ideas in mind ?

P.S.: Now I got to look up the 'lock' command...never heard of that...! Man, it was so easy to have a simple RISC instruction set on my old StrongARM Acorn Risc PC Wink
Post 03 May 2006, 21:21
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
I don't have any other ideas right now. The routines could probably be cleaned up though, go through local variables and register usage. Would also be nice if a little more "decoupling" was done, so more code could be reused in the threads (even if only as a macro) - I mean, once I had fixed up the SSE2 version, I had to make the exact same changes to the FPU version.

In the future, some runtime control of amount of threads, line interleave value etc. might be interesting, but that's more work.

And sure, public release is just fine Smile
Post 03 May 2006, 21:26
View user's profile Send private message Visit poster's website Reply with quote
GMAN



Joined: 02 May 2006
Posts: 5
Location: UK
GMAN
OK. Latest fractal benchmark test.

SSE2 211.843
FPU 88.318

The measurement seems to be very consistent from run to run
Post 10 May 2006, 02:49
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
Quote:

The measurement seems to be very consistent from run to run


They really should be, since the process/thread priorities are boosted to way high, to not let other programs interfere too much with the results. This is okay for short-running tests, but shouldn't be done for "real-world" code, btw Smile
Post 10 May 2006, 12:12
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel
What's the most interesting result at the moment, besides the almost perfect parallelisation, is the result from Madis731 with his single core P4 Prescott with Hyper Threading:

New KMB 0.5 MT:
174,009 / 321,134
Old KMB 0.3:
87,704 / 212,521

The MT version doubled the (very bad) FPU performance and mutliplied the SSE2 version by 1.5 ! Is it really the hyper threading !? (I read on wikipedia the specs and it says there's also a raised pipeline from 20 to 30 stages)...amazing...I wonder what a Pentium Extreme Dual Core with Hypethreading can do...per MHZ the values above are still more bad than AMD/P3 FPU but the SSE2 is now 20% better.
The effect didn't take place until now with any P4/Celeron without hyper-threading...

I will include his results on my homepage later in the evening.
Post 10 May 2006, 14:42
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
Hm, that is very interesting indeed, since HyperThreading can sometimes make performance *worse* when enabled. I find it very interesting in this situation because the two threads are *both* doing floating-point stuff; the best situation for HT (as far as I have understood), a bit simplified, is one thread doing FPU and another doing Integer.
Post 10 May 2006, 14:46
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel
...and it even seems to be so clearly multiplied the performance...like *2,0 for FPU and *1,5 for SSE2...may it can fill the stalled pipeline somehow with the other thread (just speculating...) !?

@Madis731...may be you can post some results with activated and disabled hyperthreading with KMB 0.5 MT ? (I don't know if it is possible to switch it off, but I thought I heard it is...)
Post 10 May 2006, 15:05
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3, 4 ... 18, 19, 20  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.