Mandelbrot Benchmark FPU/SSE2 released

Index > Windows > Mandelbrot Benchmark FPU/SSE2 released

Goto page Previous 1, 2, 3 ... 17, 18, 19, 20 Next

Author

Thread

tthsqe

Joined: 20 May 2009
Posts: 767

tthsqe 16 Oct 2009, 01:53

hm, your original version seems to be the one that executes the fastest. I tried lots of rearaignments of what I had and what you had, and none of them came close. When arraigned properly, my version doesn't really contain more dependencies than your version, so I'm wondering what exactly it is. Also, how does using xmm3 and xmm2 more often decrease the benefit of working on three pixels in the same loop? I'm not sure what you mean by this.

16 Oct 2009, 01:53

tthsqe

Joined: 20 May 2009
Posts: 767

tthsqe 16 Oct 2009, 04:04

Kuemmel,
I'm getting some interesting results. The 11-instruction method is executing consistently 5% better in the simple test I attached. What do you think could be the reason that your way perfomers much better in the actual code? I don't see how the extra access to xmm2 has any effect. Confused

Description:		Download
Filename:	test.ASM
Filesize:	2.71 KB
Downloaded:	559 Time(s)

16 Oct 2009, 04:04

tthsqe

Joined: 20 May 2009
Posts: 767

tthsqe 16 Oct 2009, 05:47

Whoa, I tripled the size of the loop, but it executes in only a fraction more of the time. Is the cpu really able to look ahead all of the way to the next block and rename the xmm2 and xmm3? This is quite shooking to me - I had no idea it could look so far ahead. In this case, the interaction between the blocks is more important than the length of the blocks.

16 Oct 2009, 05:47

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20689
Location: In your JS exploiting you and your system

revolution 16 Oct 2009, 07:00

IIRC the Core2 can have 120+ instructions inside the CPU at any one time. However unrolling loops is probably not benefiting from register renaming as much as it is benefiting from lower branch prediction overhead.

16 Oct 2009, 07:00

Kuemmel

Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany

Kuemmel 16 Oct 2009, 15:41

tthsqe wrote:

Whoa, I tripled the size of the loop, but it executes in only a fraction more of the time. Is the cpu really able to look ahead all of the way to the next block and rename the xmm2 and xmm3? This is quite shooking to me - I had no idea it could look so far ahead. In this case, the interaction between the blocks is more important than the length of the blocks.

It was also very surprising to me to learn about this, so in the code there are these three blocks, using 3 points = 3 x 2 = 6 registers and the left over 2 registers for auxiliary, each block 2 different point registers. It seems that the C2D can arrange and perform these kind of code sequences so much better. The loop unroling itself (4 times this 3 points block) doesn't help so much, it's the block itself doing the magic.

The performance gain on other/older CPU's, isn't that big, you see this at the performace gain graphs on page 14 of the thread, where I measured that in comparison to my first totally unoptimized single block, no loop unroling, no different exits code, etc.

16 Oct 2009, 15:41

tthsqe

Joined: 20 May 2009
Posts: 767

tthsqe 17 Oct 2009, 03:12

Kuemmel,
If you are looking for another benchmarking idea, I think seeing how fast one can solve a few hundred thousand sudokus would be an interesting challenge - it really tests the integer/branching unit. Would you be up for this?

17 Oct 2009, 03:12

Kuemmel

Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany

Kuemmel 17 Oct 2009, 06:25

...sounds interesting...I was thinking about some 2D-Image transformation algorithms to benchmark, so a combination of FPU/SSE2 and memory access. Especially to cover memory access in a kind of "real world" thing with multi CPU and how it scales would be interesting I think...I guess the sudoku would do it without much memory access, or ?

17 Oct 2009, 06:25

tthsqe

Joined: 20 May 2009
Posts: 767

tthsqe 17 Oct 2009, 19:59

Oh, ok. It just that I have been trying to make one, and I would like to push the limits in much the same way as your Mandelbrot benchmark pushes the limits, but it looks like I don't know these cpus well enough to do it quite right.
The integer part of SSE can be used and multi cores can be ulilized. But, even for the hardest 16x16 sudokus that require a lot of backtracking, I don't see using over 1 megebyte. Did you have more in mind?

17 Oct 2009, 19:59

Kuemmel

Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany

Kuemmel 18 Oct 2009, 16:00

...actually didn't think about how much mem would be usefull, but 1 MByte should okay, could lead to interesting results, as I think AMD uses non shared and C2D shared L2 cache, at least should be bigger than L1 cache...anyway, may be you can start a new thread and post your example code or a C-code of the algoritm ?

18 Oct 2009, 16:00

tthsqe

Joined: 20 May 2009
Posts: 767

tthsqe 18 Oct 2009, 17:15

Sure. I started a tread Projects and Ideas > fast sudoku solver in windows a while ago. I will post the code and algorithm in a few days.

18 Oct 2009, 17:15

tthsqe

Joined: 20 May 2009
Posts: 767

tthsqe 04 Nov 2009, 07:27

Kuemmel,
Could fpu code ( I mean the fld, fstp, fmul,...) benifit in the same way as the sse code did from the loop unrolling. That is, is the fpu unit pipelined and capable of out of order execution? I would say that being stack based would make these predictions very difficult and that this is probibly not implemented, but I'm not sure....

04 Nov 2009, 07:27

rugxulo

Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)

rugxulo 04 Nov 2009, 17:59

Not sure if / when FPU is out-of-order (if so, not until 686+ perhaps), but I know the FPU became fully pipelined with Intel's Pentium 1. This is why the FDIV bug appeared, because they heavily optimized it to be (at best) 5x faster. Also why Cyrix 6x86 didn't run Quake nearly as well as Intel's cpus.

04 Nov 2009, 17:59

Kuemmel

Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany

Kuemmel 06 Nov 2009, 16:26

tthsqe wrote:

Kuemmel,
Could fpu code ( I mean the fld, fstp, fmul,...) benifit in the same way as the sse code did from the loop unrolling. That is, is the fpu unit pipelined and capable of out of order execution? I would say that being stack based would make these predictions very difficult and that this is probibly not implemented, but I'm not sure....

The way of coding was applied in the same way for the SSE as for the FPU, the benefit wasn't in the same amount, but also visible, something like factor 2 for the FPU, and factor 3 overall for the SSE. So I would say, yes, it helps also. EDIT: But depending on the CPU model. The old Pentium III doesn't benefit at all, quite interesting, mostly C2D is benefiting.

To make it clear, the benefit is not mostly due to basic "loop unroling", it is the creation of the "instruction lines" with the different registers. So I use also the 8 stacked FPU reg's in almost the same way as the 8 SSE2 registers, of course the stack is not very nice and convenient to code like non-stacked SSE.

06 Nov 2009, 16:26

kalambong

Joined: 08 Nov 2008
Posts: 165

kalambong 14 Nov 2009, 04:53

Just in case you guys are interested, someone just rendered some 3D Mandelbrot at

http://www.skytopia.com/project/fractal/mandelbulb.html

and

http://www.bugman123.com/Hypercomplex/index.html

Go see them for yourself !

14 Nov 2009, 04:53

Kuemmel

Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany

Kuemmel 16 Nov 2009, 19:02

...thanks for the link ! If anybody interested just follow the thread on:

http://www.fractalforums.com/3d-fractal-generation/true-3d-mandlebrot-type-fractal/

It's truely amazing pictures what they produce. If I manage to get into the iteration code used I'll try something on asm, though I think I'm lacking mainly of the knowledge of the rendering techniques... Sad

Also now Iñigo Quílez jumped in and is experimenting along in the forum there. He's one of the best coders for procedural graphic content to my opinion, he won this years Breakpoint Demo Compo on 4 KByters.
http://iquilezles.org
http://www.pouet.net/prod.php?which=52938

16 Nov 2009, 19:02

LocoDelAssembly
Your code has a bug

Joined: 06 May 2005
Posts: 4623
Location: Argentina

LocoDelAssembly 16 Nov 2009, 20:03

Tried the demo but showed only 3 or 4 frames in the entire run on my GeForce 6600. For those with obsolete hardware like me see this instead: http://www.youtube.com/watch?v=_YWMGuh15nE

16 Nov 2009, 20:03

kalambong

Joined: 08 Nov 2008
Posts: 165

kalambong 22 Nov 2009, 12:20

Believe me, even with an ATI 4870, it's still very slow !!

I think I need to save more money for Nvidia's GTX 300 based card next year, or the next generation of ATI-GPU.

22 Nov 2009, 12:20

f0dder

Joined: 19 Feb 2004
Posts: 3174
Location: Denmark

f0dder 22 Nov 2009, 16:38

LocoDelAssembly wrote:

Tried the demo but showed only 3 or 4 frames in the entire run on my GeForce 6600. For those with obsolete hardware like me see this instead: http://www.youtube.com/watch?v=_YWMGuh15nE

Thanks for the yt link, the demo doesn't even want to start on my system (win7) - pretty damn amazing for 4kb Smile

_________________
carpe noctem

22 Nov 2009, 16:38

tthsqe

Joined: 20 May 2009
Posts: 767

tthsqe 30 Nov 2009, 02:05

Kuemmel,
I've got a version of my own working using the three instuction lines and tested it on a 2.33 GHz C2Q and a 1.60 GHz Celeron M. Both processors can issue 4 flop's per cycle, and the program consistently maxes out at only 66-68% of the processor's peak flops (with a large depth on completely black area to minimize the effect of the reloading mechanism). Would you mind converting your results to % of peak flops and telling me if you get a similar figure?. Instead of using the same two registers for the local variables of each point, we could go 64bit and see if there is a rearraingment of the instructions that helps the cpu get closer to 4 flop's per cycle.
I've attached the program and my include files.
***edit***
With 4 instruction lines and all 16 registers in use, a single thread should get more than 80% of the peak flops. It just a prediction, but it look like the lack of registers is holding back the performance a bit.

Description:		Download
Filename:	MandelbrotPlot.zip
Filesize:	166.74 KB
Downloaded:	515 Time(s)

Last edited by tthsqe on 30 Nov 2009, 11:51; edited 1 time in total

30 Nov 2009, 02:05

Madis731

Joined: 25 Sep 2003
Posts: 2138
Location: Estonia

Madis731 30 Nov 2009, 07:11

@tthsqe:
The stats are flying all over the place I don't know what to read from it. The dialog box that pops up is really awkward too. Anyway, my 2.5GHz Core 2 got about 5-6 GFLOPS, one time I was able to get 11 GFLOPS. According to 7-Zip, its capable of 4.761 GFLOPS.
Your programs gives me 50 or 60% and sometimes 110% while 7-Zip only 47.61%.

Is 4 FLOP issue really possible? Wasn't the bottleneck 3?

30 Nov 2009, 07:11

Goto page Previous 1, 2, 3 ... 17, 18, 19, 20 Next

< Last Thread | Next Thread >

Forum Rules:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum