flat assembler
Message board for the users of flat assembler.
Index
> Windows > Mandelbrot Benchmark FPU/SSE2 released Goto page Previous 1, 2, 3 ... 17, 18, 19, 20 Next |
Author |
|
tthsqe 16 Oct 2009, 04:04
Kuemmel,
I'm getting some interesting results. The 11-instruction method is executing consistently 5% better in the simple test I attached. What do you think could be the reason that your way perfomers much better in the actual code? I don't see how the extra access to xmm2 has any effect.
|
|||||||||||
16 Oct 2009, 04:04 |
|
tthsqe 16 Oct 2009, 05:47
Whoa, I tripled the size of the loop, but it executes in only a fraction more of the time. Is the cpu really able to look ahead all of the way to the next block and rename the xmm2 and xmm3? This is quite shooking to me - I had no idea it could look so far ahead. In this case, the interaction between the blocks is more important than the length of the blocks.
|
|||
16 Oct 2009, 05:47 |
|
revolution 16 Oct 2009, 07:00
IIRC the Core2 can have 120+ instructions inside the CPU at any one time. However unrolling loops is probably not benefiting from register renaming as much as it is benefiting from lower branch prediction overhead.
|
|||
16 Oct 2009, 07:00 |
|
Kuemmel 16 Oct 2009, 15:41
tthsqe wrote: Whoa, I tripled the size of the loop, but it executes in only a fraction more of the time. Is the cpu really able to look ahead all of the way to the next block and rename the xmm2 and xmm3? This is quite shooking to me - I had no idea it could look so far ahead. In this case, the interaction between the blocks is more important than the length of the blocks. It was also very surprising to me to learn about this, so in the code there are these three blocks, using 3 points = 3 x 2 = 6 registers and the left over 2 registers for auxiliary, each block 2 different point registers. It seems that the C2D can arrange and perform these kind of code sequences so much better. The loop unroling itself (4 times this 3 points block) doesn't help so much, it's the block itself doing the magic. The performance gain on other/older CPU's, isn't that big, you see this at the performace gain graphs on page 14 of the thread, where I measured that in comparison to my first totally unoptimized single block, no loop unroling, no different exits code, etc. |
|||
16 Oct 2009, 15:41 |
|
tthsqe 17 Oct 2009, 03:12
Kuemmel,
If you are looking for another benchmarking idea, I think seeing how fast one can solve a few hundred thousand sudokus would be an interesting challenge - it really tests the integer/branching unit. Would you be up for this? |
|||
17 Oct 2009, 03:12 |
|
Kuemmel 17 Oct 2009, 06:25
...sounds interesting...I was thinking about some 2D-Image transformation algorithms to benchmark, so a combination of FPU/SSE2 and memory access. Especially to cover memory access in a kind of "real world" thing with multi CPU and how it scales would be interesting I think...I guess the sudoku would do it without much memory access, or ?
|
|||
17 Oct 2009, 06:25 |
|
tthsqe 17 Oct 2009, 19:59
Oh, ok. It just that I have been trying to make one, and I would like to push the limits in much the same way as your Mandelbrot benchmark pushes the limits, but it looks like I don't know these cpus well enough to do it quite right.
The integer part of SSE can be used and multi cores can be ulilized. But, even for the hardest 16x16 sudokus that require a lot of backtracking, I don't see using over 1 megebyte. Did you have more in mind? |
|||
17 Oct 2009, 19:59 |
|
Kuemmel 18 Oct 2009, 16:00
...actually didn't think about how much mem would be usefull, but 1 MByte should okay, could lead to interesting results, as I think AMD uses non shared and C2D shared L2 cache, at least should be bigger than L1 cache...anyway, may be you can start a new thread and post your example code or a C-code of the algoritm ?
|
|||
18 Oct 2009, 16:00 |
|
tthsqe 18 Oct 2009, 17:15
Sure. I started a tread Projects and Ideas > fast sudoku solver in windows a while ago. I will post the code and algorithm in a few days.
|
|||
18 Oct 2009, 17:15 |
|
tthsqe 04 Nov 2009, 07:27
Kuemmel,
Could fpu code ( I mean the fld, fstp, fmul,...) benifit in the same way as the sse code did from the loop unrolling. That is, is the fpu unit pipelined and capable of out of order execution? I would say that being stack based would make these predictions very difficult and that this is probibly not implemented, but I'm not sure.... |
|||
04 Nov 2009, 07:27 |
|
rugxulo 04 Nov 2009, 17:59
Not sure if / when FPU is out-of-order (if so, not until 686+ perhaps), but I know the FPU became fully pipelined with Intel's Pentium 1. This is why the FDIV bug appeared, because they heavily optimized it to be (at best) 5x faster. Also why Cyrix 6x86 didn't run Quake nearly as well as Intel's cpus.
|
|||
04 Nov 2009, 17:59 |
|
Kuemmel 06 Nov 2009, 16:26
tthsqe wrote: Kuemmel, The way of coding was applied in the same way for the SSE as for the FPU, the benefit wasn't in the same amount, but also visible, something like factor 2 for the FPU, and factor 3 overall for the SSE. So I would say, yes, it helps also. EDIT: But depending on the CPU model. The old Pentium III doesn't benefit at all, quite interesting, mostly C2D is benefiting. To make it clear, the benefit is not mostly due to basic "loop unroling", it is the creation of the "instruction lines" with the different registers. So I use also the 8 stacked FPU reg's in almost the same way as the 8 SSE2 registers, of course the stack is not very nice and convenient to code like non-stacked SSE. |
|||
06 Nov 2009, 16:26 |
|
kalambong 14 Nov 2009, 04:53
Just in case you guys are interested, someone just rendered some 3D Mandelbrot at
http://www.skytopia.com/project/fractal/mandelbulb.html and http://www.bugman123.com/Hypercomplex/index.html Go see them for yourself ! |
|||
14 Nov 2009, 04:53 |
|
Kuemmel 16 Nov 2009, 19:02
...thanks for the link ! If anybody interested just follow the thread on:
http://www.fractalforums.com/3d-fractal-generation/true-3d-mandlebrot-type-fractal/ It's truely amazing pictures what they produce. If I manage to get into the iteration code used I'll try something on asm, though I think I'm lacking mainly of the knowledge of the rendering techniques... Also now Iñigo Quílez jumped in and is experimenting along in the forum there. He's one of the best coders for procedural graphic content to my opinion, he won this years Breakpoint Demo Compo on 4 KByters. http://iquilezles.org http://www.pouet.net/prod.php?which=52938 |
|||
16 Nov 2009, 19:02 |
|
LocoDelAssembly 16 Nov 2009, 20:03
Tried the demo but showed only 3 or 4 frames in the entire run on my GeForce 6600. For those with obsolete hardware like me see this instead: http://www.youtube.com/watch?v=_YWMGuh15nE
|
|||
16 Nov 2009, 20:03 |
|
kalambong 22 Nov 2009, 12:20
Believe me, even with an ATI 4870, it's still very slow !!
I think I need to save more money for Nvidia's GTX 300 based card next year, or the next generation of ATI-GPU. |
|||
22 Nov 2009, 12:20 |
|
f0dder 22 Nov 2009, 16:38
LocoDelAssembly wrote: Tried the demo but showed only 3 or 4 frames in the entire run on my GeForce 6600. For those with obsolete hardware like me see this instead: http://www.youtube.com/watch?v=_YWMGuh15nE _________________ - carpe noctem |
|||
22 Nov 2009, 16:38 |
|
tthsqe 30 Nov 2009, 02:05
Kuemmel,
I've got a version of my own working using the three instuction lines and tested it on a 2.33 GHz C2Q and a 1.60 GHz Celeron M. Both processors can issue 4 flop's per cycle, and the program consistently maxes out at only 66-68% of the processor's peak flops (with a large depth on completely black area to minimize the effect of the reloading mechanism). Would you mind converting your results to % of peak flops and telling me if you get a similar figure?. Instead of using the same two registers for the local variables of each point, we could go 64bit and see if there is a rearraingment of the instructions that helps the cpu get closer to 4 flop's per cycle. I've attached the program and my include files. ***edit*** With 4 instruction lines and all 16 registers in use, a single thread should get more than 80% of the peak flops. It just a prediction, but it look like the lack of registers is holding back the performance a bit.
Last edited by tthsqe on 30 Nov 2009, 11:51; edited 1 time in total |
|||||||||||
30 Nov 2009, 02:05 |
|
Madis731 30 Nov 2009, 07:11
@tthsqe:
The stats are flying all over the place I don't know what to read from it. The dialog box that pops up is really awkward too. Anyway, my 2.5GHz Core 2 got about 5-6 GFLOPS, one time I was able to get 11 GFLOPS. According to 7-Zip, its capable of 4.761 GFLOPS. Your programs gives me 50 or 60% and sometimes 110% while 7-Zip only 47.61%. Is 4 FLOP issue really possible? Wasn't the bottleneck 3? |
|||
30 Nov 2009, 07:11 |
|
Goto page Previous 1, 2, 3 ... 17, 18, 19, 20 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.