Mandelbrot Benchmark FPU/SSE2 released

Index > Windows > Mandelbrot Benchmark FPU/SSE2 released

Goto page Previous 1, 2, 3, 4 ... 18, 19, 20 Next

Author

Thread

Kuemmel

Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany

Kuemmel 02 May 2006, 20:36

Some more timings from my single cores with your Version 3:

Athlon 1000 MHz:
KMB (original version) FPU: 65,324
fOdder3fpu: 65,711

Sempron 1800 MHz
KMB (original version) FPU: 117,865
KMB (original version) SSE: 161,422
fOdder3fpu: 118,299
fOdder3SSE: 161,031

...so no real difference for the single core ! The extra code for the threading doesn't touch the results, that's nice !

@madmatt...I wonder about your results...does the threading or the other improvements only push the Intel cores forward ? May be the hyper threading ? Getting confused...

02 May 2006, 20:36

GMAN

Joined: 02 May 2006
Posts: 5
Location: UK

GMAN 02 May 2006, 22:33

In my humble opinion the best route is
a) separate threads for adjacent pixels, (number of processors and hence threads to be determined at run time?)
b) I think locodelassembly hints at using offscreen drawing and DirectX BitBlit, for instance, to transfer the image to screen in one large spoonful. There's the Russian chap, Peter Kankowski, who built an SSE/SSE2 engine which allows screen manipulation and zooming in realtime. (code available) Link: http://www.codeproject.com/cpp/fractalssse.asp
c) There are numerous ways of building the algorithm which reduce the amount of brute force processing required; in my time I've tinkered with both statistical and edge finder methods for fractal drawing; but the best of all is a derivative of the statistical method which involves a simple assumption about the behaviour of exponentiated numbers. More to follow... Wink

02 May 2006, 22:33

madmatt

Joined: 07 Oct 2003
Posts: 1045
Location: Michigan, USA

madmatt 03 May 2006, 01:30

f0dder: looks like the sse2 version.

Quote:

Kuemmel:
@madmatt...I wonder about your results...does the threading or the other improvements only push the Intel cores forward ? May be the hyper threading ? Getting confused...

First, forgot to mention that I have only single cpu core (Celeron PIV) running at 2.7ghz, with 1GB ddr sdram, 266mhz RAM I think. I not sure whether its a better algorithm, hyperthreading, Improved Celeron performance, better CPU -> VIDEOMEMORY accsess, or combinations. When I ran your original code I got:SSE2 -> 223.737, not to far from f0dders results

03 May 2006, 01:30

Kuemmel

Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany

Kuemmel 03 May 2006, 05:35

GMAN wrote:

In my humble opinion the best route is
a) separate threads for adjacent pixels, (number of processors and hence threads to be determined at run time?)
b) I think locodelassembly hints at using offscreen drawing and DirectX BitBlit, for instance, to transfer the image to screen in one large spoonful. There's the Russian chap, Peter Kankowski, who built an SSE/SSE2 engine which allows screen manipulation and zooming in realtime. (code available) Link: http://www.codeproject.com/cpp/fractalssse.asp
c) There are numerous ways of building the algorithm which reduce the amount of brute force processing required; in my time I've tinkered with both statistical and edge finder methods for fractal drawing; but the best of all is a derivative of the statistical method which involves a simple assumption about the behaviour of exponentiated numbers. More to follow...

Hi GMAN, about
b) Actually I'm using Peter Kankowskis algorithm already, it's his MOV-ADD version for it, I refer to it in my credits ! He also got other versions of the code where you gain some small percentage for Pentium-M/Core Architecture.
About offscreen drawing...I don't expect much of it, at least on my AMD's the spent penalty for drawing the fractal overall is very low, less than 1 or 2 percent of total time of computation.

c) That's true of course. On my old Acorn Risc Machine I implemented a boundary tracing algorithm, like checking squares and filling them if they are 'black' (=maximum iterations)...but here my main focus was on simply checking out the different FPU's of AMD and INTEL cpu's of current and the past...for a true mandelbrot app there's really a lot of optimization to think about, of course.

03 May 2006, 05:35

LocoDelAssembly
Your code has a bug

Joined: 06 May 2005
Posts: 4623
Location: Argentina

LocoDelAssembly 03 May 2006, 12:08

Code:

;     mov     ebx,[ddsd.lpSurface]                ;get address to surface
     invoke  VirtualAlloc, NULL, $1000000, MEM_COMMIT, PAGE_READWRITE
     mov     ebx, eax

Even with that I get 176 like always Sad

03 May 2006, 12:08

LocoDelAssembly
Your code has a bug

Joined: 06 May 2005
Posts: 4623
Location: Argentina

LocoDelAssembly 03 May 2006, 13:33

I get up to 182 with this "blind" version

03 May 2006, 13:33

GMAN

Joined: 02 May 2006
Posts: 5
Location: UK

GMAN 03 May 2006, 15:21

Kuemmel wrote:

Hi GMAN, about b) Actually I'm using Peter Kankowskis algorithm already

Derr! (GMan Slaps Forehead) His version only uses 64 iterations out of the box. When I upped it to 512 it was a lot slower

Quote:

About offscreen drawing...I don't expect much of it, at least on my AMD's the spent penalty for drawing the fractal overall is very low, less than 1 or 2 percent of total time of computation.

Hmm. The blind SSE version scored exactly the same on my rig

Quote:

c) That's true of course. On my old Acorn Risc Machine I implemented a boundary tracing algorithm, like checking squares and filling them if they are 'black' (=maximum iterations)...but here my main focus was on simply checking out the different FPU's of AMD and INTEL cpu's of current and the past...for a true mandelbrot app there's really a lot of optimization to think about, of course.

Ah the good old days. Mine was on an Atari with Hisoft Basic. Fair enough point, Kümmel.

If I ever manage to get the hang of assembly writing I'll post some of my brainfruit for everyone's delectation. I have one written in VB with a VC6 DLL that uses arbitrary powers of 'n' for |z|=|z^n|+|c|. It's incredibly slow in FPU, though it does produce fantastic animations. First job. I'll try and modify that to use SSE and see how it performs

03 May 2006, 15:21

Kuemmel

Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany

Kuemmel 03 May 2006, 19:35

Hi fOdder or anybody else with a Dual Core...

just to prove myself wrong (or right Wink

) I modified fOdders code so that I can decide how much lines each thread calculates and waits then for the other. You can find it on
www.mikusite.de/x86/KMB_lines_test.zip

There are 4 versions (SSE2) to test:
1 line per thread
5 lines per thread
30 lines per thread
300 lines per thread (this should be the same like fOdders original)

You can experiment yourself with other values. Just change the value 'LINES' in the beginning to any other number. The number must give full numbers when you divide 600 by it.

Can you post me some timings on them for your dual cores ?

On my single core I can see a slow down for 1 and 5 lines due to the threading overhead, but still I hope the calculation overall benefits on the dual core due to other waiting times for the threads on special non symetric mandelbrot frames...but I can be wrong also...

03 May 2006, 19:35

madmatt

Joined: 07 Oct 2003
Posts: 1045
Location: Michigan, USA

madmatt 03 May 2006, 19:56

Do any of these tests draw in system memory then copies to ddraw video memory? I wonder how this would effect the timings.

03 May 2006, 19:56

f0dder

Joined: 19 Feb 2004
Posts: 3174
Location: Denmark

f0dder 03 May 2006, 19:57

Interesting... and doh, hadn't thought about the assymetric nature of the sets.

Code:

1line:   350fps
5line:   375fps
10line:  375fps
30line:  357fps
300line: 360fps

Hm. So there's an advantage in having a finer "granularity" for rendering. 5-10 lines seem good - you don't want to go lower anyway because there'll be too much overhead per line then. Having to create/destroy threads all the time isn't nice, though.

It should be possible to set up one thread for each CPU (or stick with 2 threads as now), and still have the main code only create threads once and do the WaitForXxx. The threads would then handle the scan-line interleaving manually.

We could probably use "lock xadd" to get-and-update the current line draw-offset in an atomic way, without having to use more-expensive user->kernel->user transitions and synchronization objects.

03 May 2006, 19:57

LocoDelAssembly
Your code has a bug

Joined: 06 May 2005
Posts: 4623
Location: Argentina

LocoDelAssembly 03 May 2006, 20:21

f0dder, do you mean this?

Code:

threadLoop:
  mov eax, 1
  lock xadd [scanline], eax
  cmp eax, MAX_SCANLINE+1
  je    exitThreadLoop
  
; Work on the scanline that EAX says

jmp  threadLoop

03 May 2006, 20:21

f0dder

Joined: 19 Feb 2004
Posts: 3174
Location: Denmark

f0dder 03 May 2006, 20:33

No loco, check this upload Smile

(but it was close).

Performance when configured for 5-line interleave:

Code:

fpu:  285 fps
sse2: 395 fps

EDIT: whoa, it just occured to me that this is pretty close to perfect parallelism! The singlethread version gave me 142/194 for fpu/sse2. My v5 code gives 285/395fps with two threads and 5-line interleave. That's very close to 2x original speed.

Description:		Download
Filename:	f0dder_kmb_5.zip
Filesize:	10.94 KB
Downloaded:	520 Time(s)

_________________
carpe noctem

03 May 2006, 20:33

LocoDelAssembly
Your code has a bug

Joined: 06 May 2005
Posts: 4623
Location: Argentina

LocoDelAssembly 03 May 2006, 20:56

178-179 on my single core, nice Very Happy

I want a dual-core Sad

03 May 2006, 20:56

Kuemmel

Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany

Kuemmel 03 May 2006, 21:21

f0dder wrote:

EDIT: whoa, it just occured to me that this is pretty close to perfect parallelism! The singlethread version gave me 142/194 for fpu/sse2. My v5 code gives 285/395fps with two threads and 5-line interleave. That's very close to 2x original speed.

Great !!! I didn't even expect that it will go to 200%, before it was like 180% if I'm right...perfect...joint forces and ideas seem to work Wink

. I checked it also on my single core...no penalty at 5 lines !

Would you mind if I make a new public release (with credits of course) on my webapge for the KMB based on your last version...or still other ideas in mind ?

P.S.: Now I got to look up the 'lock' command...never heard of that...! Man, it was so easy to have a simple RISC instruction set on my old StrongARM Acorn Risc PC Wink

03 May 2006, 21:21

f0dder

Joined: 19 Feb 2004
Posts: 3174
Location: Denmark

f0dder 03 May 2006, 21:26

I don't have any other ideas right now. The routines could probably be cleaned up though, go through local variables and register usage. Would also be nice if a little more "decoupling" was done, so more code could be reused in the threads (even if only as a macro) - I mean, once I had fixed up the SSE2 version, I had to make the exact same changes to the FPU version.

In the future, some runtime control of amount of threads, line interleave value etc. might be interesting, but that's more work.

And sure, public release is just fine Smile

03 May 2006, 21:26

GMAN

Joined: 02 May 2006
Posts: 5
Location: UK

GMAN 10 May 2006, 02:49

OK. Latest fractal benchmark test.

SSE2 211.843
FPU 88.318

The measurement seems to be very consistent from run to run

10 May 2006, 02:49

f0dder

Joined: 19 Feb 2004
Posts: 3174
Location: Denmark

f0dder 10 May 2006, 12:12

Quote:

The measurement seems to be very consistent from run to run

They really should be, since the process/thread priorities are boosted to way high, to not let other programs interfere too much with the results. This is okay for short-running tests, but shouldn't be done for "real-world" code, btw Smile

10 May 2006, 12:12

Kuemmel

Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany

Kuemmel 10 May 2006, 14:42

What's the most interesting result at the moment, besides the almost perfect parallelisation, is the result from Madis731 with his single core P4 Prescott with Hyper Threading:

New KMB 0.5 MT:
174,009 / 321,134
Old KMB 0.3:
87,704 / 212,521

The MT version doubled the (very bad) FPU performance and mutliplied the SSE2 version by 1.5 ! Is it really the hyper threading !? (I read on wikipedia the specs and it says there's also a raised pipeline from 20 to 30 stages)...amazing...I wonder what a Pentium Extreme Dual Core with Hypethreading can do...per MHZ the values above are still more bad than AMD/P3 FPU but the SSE2 is now 20% better.
The effect didn't take place until now with any P4/Celeron without hyper-threading...

I will include his results on my homepage later in the evening.

10 May 2006, 14:42

f0dder

Joined: 19 Feb 2004
Posts: 3174
Location: Denmark

f0dder 10 May 2006, 14:46

Hm, that is very interesting indeed, since HyperThreading can sometimes make performance *worse* when enabled. I find it very interesting in this situation because the two threads are *both* doing floating-point stuff; the best situation for HT (as far as I have understood), a bit simplified, is one thread doing FPU and another doing Integer.

10 May 2006, 14:46

Kuemmel

Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany

Kuemmel 10 May 2006, 15:05

...and it even seems to be so clearly multiplied the performance...like *2,0 for FPU and *1,5 for SSE2...may it can fill the stalled pipeline somehow with the other thread (just speculating...) !?

@Madis731...may be you can post some results with activated and disabled hyperthreading with KMB 0.5 MT ? (I don't know if it is possible to switch it off, but I thought I heard it is...)

10 May 2006, 15:05

Goto page Previous 1, 2, 3, 4 ... 18, 19, 20 Next

< Last Thread | Next Thread >

Forum Rules:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum