flat assembler
Message board for the users of flat assembler.
Index
> Linux > Mandelbrot renderer Goto page 1, 2 Next |
Author |
|
randall 18 Mar 2012, 19:51
I wrote a little program which renders Mandelbrot fractal. Output is saved in TGA file. You can change g_center and g_zoom variables to render different areas of the fractal. By changing g_size variable you can change output file resolution. Program uses no external library (only Linux syscalls). I hope it will be useful for someone.
Thanks.
Last edited by randall on 20 Mar 2012, 15:51; edited 1 time in total |
||||||||||||||||||||
18 Mar 2012, 19:51 |
|
Endre 18 Mar 2012, 20:31
Beautiful! On my machine it takes 1.5s to generate the tga (2560018 bytes) file with your default settings.
real 0m1.498s user 0m1.492s sys 0m0.004s |
|||
18 Mar 2012, 20:31 |
|
randall 18 Mar 2012, 20:36
What CPU do you have? On my Core2 Duo 1.86 GHz it takes 1.9s
real 0m1.919s user 0m1.910s sys 0m0.000s |
|||
18 Mar 2012, 20:36 |
|
typedef 19 Mar 2012, 03:38
can you convert this to Windows somehow.
|
|||
19 Mar 2012, 03:38 |
|
randall 19 Mar 2012, 07:23
typedef wrote: can you convert this to Windows somehow. No, sorry. I do not use windows. |
|||
19 Mar 2012, 07:23 |
|
Endre 19 Mar 2012, 09:22
It's an old (slowly 5 years old) Core2 Quad (Q6600 or what) 2.4GHz.
|
|||
19 Mar 2012, 09:22 |
|
typedef 19 Mar 2012, 12:53
randall wrote:
Then can you share the algo? |
|||
19 Mar 2012, 12:53 |
|
randall 19 Mar 2012, 12:59
typedef wrote:
Sure. http://www.iquilezles.org/www/articles/distancefractals/distancefractals.htm |
|||
19 Mar 2012, 12:59 |
|
typedef 19 Mar 2012, 14:32
thanks
|
|||
19 Mar 2012, 14:32 |
|
Coty 19 Mar 2012, 16:24
Were does it display it's run time?
|
|||
19 Mar 2012, 16:24 |
|
randall 19 Mar 2012, 16:53
Coty wrote: Were does it display it's run time? Output is saved in a TGA file. |
|||
19 Mar 2012, 16:53 |
|
LocoDelAssembly 19 Mar 2012, 17:04
"time ./mandelbrot"
|
|||
19 Mar 2012, 17:04 |
|
Coty 19 Mar 2012, 17:22
wow, I've been using linux how long? 3 years now? and was not aware of the time command? you know how much time that would have saved me instead of doing it myself... >.<
Anyway here are results for a core i3 @ 2.3Ghz... Code: real 0m1.268s user 0m1.264s sys 0m0.004s |
|||
19 Mar 2012, 17:22 |
|
randall 19 Mar 2012, 17:47
I have also written C++ version and compiler generated code is faster then mine. I wonder how this is happening. In my code I have implemented complex number multiplication using only one mulps, one addps and couple of shufps. Compiler have generated lots of mulss and addss and is still faster... This is kind of depressing... How hard it is to beat the compiler?
|
|||
19 Mar 2012, 17:47 |
|
gunblade 19 Mar 2012, 18:16
You should use png for the uploaded image - stores greyscale very well (also images which are "patterened"/generated rather than a photo) without a loss in quality
But very neat little program, thanks. As for the speed improvements - Are you using any optimization (-O{1,2,3,4}) with GCC/G++? Might be worth comparing both unoptimized (-O0) and max optimization (which might break the program) (-O4), also things like.. -march=native -mcpu=native to optimize to your current cpu that you're compiling on (will allow gcc to use things like SSE/MMX/etc). As for the speed gain by the compiler, it may have more to do with things like aligned accesses, and multiple pipelines (do intel's still have that U/V pipeline, or is that pentium 3 technology? ). But yeah, things like, making sure that you dont access a register twice in two instructions will mean that the processor can parallel those (without the need for threads/etc). Another optimization you could maybe do (although GCC wont do this by default, even with -O4) would be adding threading support (in assembly, look at the syscall clone() and its manpage, in C you could do the same, but normally you'd use pthreads or something to make it easier on yourself). Just because at the moment, your program will only ever be using a single core - so this could easily double (nearly..) the speed of processing on a dualcore, quadruple on a 4-core, etc.. You just have to find a way to parallel the work done without causing issues (you'll probably need locks to make sure you dont write to the same memory at the same time - but depends how you do it. if you know how big the image is going to be, you may/should be able to split the image into multiple sections and process those sections simulatenously.. then the only "lock" you'll need will be at the end of the program to make sure that every thread is finished (in case one runs slower than others) before you return to the user. I'll take a closer look at the code at some point and see if theres any obvious optimizations I can see. |
|||
19 Mar 2012, 18:16 |
|
randall 19 Mar 2012, 18:32
gunblade wrote: You should use png for the uploaded image - stores greyscale very well (also images which are "patterened"/generated rather than a photo) without a loss in quality Thanks for the reply. I want to keep the program as simple (pure) as possible. No external libraries, only one thread. I have compiled C++ version with: g++ mandelbrot.cpp -o mandelbrot -O3 clang++ mandelbrot.cpp -o mandelbrot -O3 (clang is a bit faster in this case) It would be great if you could find some way to improve the performance of my single threaded assembly code. |
|||
19 Mar 2012, 18:32 |
|
Madis731 20 Mar 2012, 09:58
There's a tradeoff in MMX/SSE code. You can choose to calculate 4 pixels at once and code SSE to do 4 parallel calculations piping the same algorithm through SSE instructions.
Another approach (to do *one* pixel very quickly) is to use SSE to calculate 4 parts of your algorithm in parallel, but to operate on a single pixel at a time. The first approach is usually simpler and faster, but the second approach is good for situations where the second pixel depends on the first etc. In that situation optimizations for making a single 'pipe' faster is more profitable. Now the optimizations: EDIT2 - now I see what you've done, I've answered my own question. |
|||
20 Mar 2012, 09:58 |
|
randall 20 Mar 2012, 11:18
I have done one optimization in the most inner loop (use ucomiss instead of cmpps).
Now the results are (Core2 Duo 6300 @ 1.86 GHz): ASM version: real 0m1.841s user 0m1.830s sys 0m0.010s C++ version (with -O3 flag): real 0m1.133s user 0m1.120s sys 0m0.010s |
|||
20 Mar 2012, 11:18 |
|
gunblade 20 Mar 2012, 11:37
Was looking at the code last night - but it looks very neat as it is, and theres no obvious cpu stall points that i can see.. Might be worth removing the TGA output code temporary and comparing them again (make sure its not just that you're writting the output file in an unoptimal way - although you're doing one big write, which should be fast - but worth testing to make sure that this is not the cause - just remember to take the TGA output out from both the C and asm versions (wouldnt be fair to only take it out of the asm version ))
I was going to use callgrind (from the valgrind package/suite of tools) to try to profile it and see where the slowest sections are.. you may want to try it on both the C and asm versions, and see what it says about execution times of loops/functions. I'll let you know if i find anything obvious - but as i say, the code looks really good.. I worry it might be an alignment thing since its SSE - but you seem to have done all the alignment required - so would be weird for it not to work well. EDIT: Well valgrind wasnt useful. It only seperates at call level, not loop level. So for your program (which has very calls), its way too vague. Might have to do it manually by adding calls to the clock_gettime syscall to read the CLOCK_PROCESS_CPUTIME_ID clock which will give "High-resolution per-process timer from the CPU.", so the same as the time command, but can be inserted in various places in your code to find what's taking time. Just careful because this syscall itself will add time to the process's execution time.. so you may only want to put one, moving it around, and only counting UP to this syscall. |
|||
20 Mar 2012, 11:37 |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.