flat assembler
Message board for the users of flat assembler.

Index > Linux > Mandelbrot renderer

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall
I wrote a little program which renders Mandelbrot fractal. Output is saved in TGA file. You can change g_center and g_zoom variables to render different areas of the fractal. By changing g_size variable you can change output file resolution. Program uses no external library (only Linux syscalls). I hope it will be useful for someone.

Thanks.


Description:
Filesize: 68.04 KB
Viewed: 5807 Time(s)

mandelbrot0.jpg


Description:
Download
Filename: mandelbrot.asm
Filesize: 9.46 KB
Downloaded: 293 Time(s)



Last edited by randall on 20 Mar 2012, 15:51; edited 1 time in total
Post 18 Mar 2012, 19:51
View user's profile Send private message Visit poster's website Reply with quote
Endre



Joined: 29 Dec 2003
Posts: 212
Location: Budapest, Hungary
Endre
Beautiful! On my machine it takes 1.5s to generate the tga (2560018 bytes) file with your default settings.

real 0m1.498s
user 0m1.492s
sys 0m0.004s
Post 18 Mar 2012, 20:31
View user's profile Send private message Reply with quote
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall
What CPU do you have? On my Core2 Duo 1.86 GHz it takes 1.9s

real 0m1.919s
user 0m1.910s
sys 0m0.000s
Post 18 Mar 2012, 20:36
View user's profile Send private message Visit poster's website Reply with quote
typedef



Joined: 25 Jul 2010
Posts: 2913
Location: 0x77760000
typedef
can you convert this to Windows somehow.
Post 19 Mar 2012, 03:38
View user's profile Send private message Reply with quote
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall
typedef wrote:
can you convert this to Windows somehow.


No, sorry. I do not use windows.
Post 19 Mar 2012, 07:23
View user's profile Send private message Visit poster's website Reply with quote
Endre



Joined: 29 Dec 2003
Posts: 212
Location: Budapest, Hungary
Endre
It's an old (slowly 5 years old) Core2 Quad (Q6600 or what) 2.4GHz.
Post 19 Mar 2012, 09:22
View user's profile Send private message Reply with quote
typedef



Joined: 25 Jul 2010
Posts: 2913
Location: 0x77760000
typedef
randall wrote:
typedef wrote:
can you convert this to Windows somehow.


No, sorry. I do not use windows.


Then can you share the algo?
Post 19 Mar 2012, 12:53
View user's profile Send private message Reply with quote
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall
typedef wrote:
randall wrote:
typedef wrote:
can you convert this to Windows somehow.


No, sorry. I do not use windows.


Then can you share the algo?


Sure. http://www.iquilezles.org/www/articles/distancefractals/distancefractals.htm
Post 19 Mar 2012, 12:59
View user's profile Send private message Visit poster's website Reply with quote
typedef



Joined: 25 Jul 2010
Posts: 2913
Location: 0x77760000
typedef
thanks
Post 19 Mar 2012, 14:32
View user's profile Send private message Reply with quote
Coty



Joined: 17 May 2010
Posts: 554
Location: ␀
Coty
Were does it display it's run time?
Post 19 Mar 2012, 16:24
View user's profile Send private message Send e-mail Visit poster's website Reply with quote
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall
Coty wrote:
Were does it display it's run time?


Output is saved in a TGA file.
Post 19 Mar 2012, 16:53
View user's profile Send private message Visit poster's website Reply with quote
Coty



Joined: 17 May 2010
Posts: 554
Location: ␀
Coty
Sorry, I found "mandelbrot.tga" in my home folder, and it generated it fine...

But I mean this right here:
Quote:
real 0m1.919s
user 0m1.910s
sys 0m0.000s

I am not seeing this displayed anywhere? I launch from my home folder as "./mandalbrot" is there a command perimeter I need?

_________________
http://codercat.org/
Post 19 Mar 2012, 17:00
View user's profile Send private message Send e-mail Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
"time ./mandelbrot"
Post 19 Mar 2012, 17:04
View user's profile Send private message Reply with quote
Coty



Joined: 17 May 2010
Posts: 554
Location: ␀
Coty
wow, I've been using linux how long? 3 years now? and was not aware of the time command? you know how much time that would have saved me instead of doing it myself... >.<

Anyway here are results for a core i3 @ 2.3Ghz...
Code:
real   0m1.268s
user        0m1.264s
sys 0m0.004s    

_________________
http://codercat.org/
Post 19 Mar 2012, 17:22
View user's profile Send private message Send e-mail Visit poster's website Reply with quote
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall
I have also written C++ version and compiler generated code is faster then mine. I wonder how this is happening. In my code I have implemented complex number multiplication using only one mulps, one addps and couple of shufps. Compiler have generated lots of mulss and addss and is still faster... This is kind of depressing... How hard it is to beat the compiler?
Post 19 Mar 2012, 17:47
View user's profile Send private message Visit poster's website Reply with quote
gunblade



Joined: 19 Feb 2004
Posts: 209
gunblade
You should use png for the uploaded image - stores greyscale very well (also images which are "patterened"/generated rather than a photo) without a loss in quality Smile

But very neat little program, thanks.

As for the speed improvements - Are you using any optimization (-O{1,2,3,4}) with GCC/G++? Might be worth comparing both unoptimized (-O0) and max optimization (which might break the program) (-O4), also things like.. -march=native -mcpu=native to optimize to your current cpu that you're compiling on (will allow gcc to use things like SSE/MMX/etc).

As for the speed gain by the compiler, it may have more to do with things like aligned accesses, and multiple pipelines (do intel's still have that U/V pipeline, or is that pentium 3 technology? Wink ). But yeah, things like, making sure that you dont access a register twice in two instructions will mean that the processor can parallel those (without the need for threads/etc).

Another optimization you could maybe do (although GCC wont do this by default, even with -O4) would be adding threading support (in assembly, look at the syscall clone() and its manpage, in C you could do the same, but normally you'd use pthreads or something to make it easier on yourself). Just because at the moment, your program will only ever be using a single core - so this could easily double (nearly..) the speed of processing on a dualcore, quadruple on a 4-core, etc.. You just have to find a way to parallel the work done without causing issues (you'll probably need locks to make sure you dont write to the same memory at the same time - but depends how you do it. if you know how big the image is going to be, you may/should be able to split the image into multiple sections and process those sections simulatenously.. then the only "lock" you'll need will be at the end of the program to make sure that every thread is finished (in case one runs slower than others) before you return to the user.

I'll take a closer look at the code at some point and see if theres any obvious optimizations I can see.
Post 19 Mar 2012, 18:16
View user's profile Send private message Reply with quote
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall
gunblade wrote:
You should use png for the uploaded image - stores greyscale very well (also images which are "patterened"/generated rather than a photo) without a loss in quality Smile

But very neat little program, thanks.

As for the speed improvements - Are you using any optimization (-O{1,2,3,4}) with GCC/G++? Might be worth comparing both unoptimized (-O0) and max optimization (which might break the program) (-O4), also things like.. -march=native -mcpu=native to optimize to your current cpu that you're compiling on (will allow gcc to use things like SSE/MMX/etc).

As for the speed gain by the compiler, it may have more to do with things like aligned accesses, and multiple pipelines (do intel's still have that U/V pipeline, or is that pentium 3 technology? Wink ). But yeah, things like, making sure that you dont access a register twice in two instructions will mean that the processor can parallel those (without the need for threads/etc).

Another optimization you could maybe do (although GCC wont do this by default, even with -O4) would be adding threading support (in assembly, look at the syscall clone() and its manpage, in C you could do the same, but normally you'd use pthreads or something to make it easier on yourself). Just because at the moment, your program will only ever be using a single core - so this could easily double (nearly..) the speed of processing on a dualcore, quadruple on a 4-core, etc.. You just have to find a way to parallel the work done without causing issues (you'll probably need locks to make sure you dont write to the same memory at the same time - but depends how you do it. if you know how big the image is going to be, you may/should be able to split the image into multiple sections and process those sections simulatenously.. then the only "lock" you'll need will be at the end of the program to make sure that every thread is finished (in case one runs slower than others) before you return to the user.

I'll take a closer look at the code at some point and see if theres any obvious optimizations I can see.


Thanks for the reply.
I want to keep the program as simple (pure) as possible. No external libraries, only one thread.

I have compiled C++ version with:
g++ mandelbrot.cpp -o mandelbrot -O3
clang++ mandelbrot.cpp -o mandelbrot -O3

(clang is a bit faster in this case)

It would be great if you could find some way to improve the performance of my single threaded assembly code.
Post 19 Mar 2012, 18:32
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2140
Location: Estonia
Madis731
There's a tradeoff in MMX/SSE code. You can choose to calculate 4 pixels at once and code SSE to do 4 parallel calculations piping the same algorithm through SSE instructions.
Another approach (to do *one* pixel very quickly) is to use SSE to calculate 4 parts of your algorithm in parallel, but to operate on a single pixel at a time.

The first approach is usually simpler and faster, but the second approach is good for situations where the second pixel depends on the first etc. In that situation optimizations for making a single 'pipe' faster is more profitable.

Now the optimizations:
EDIT2 - now I see what you've done, I've answered my own question.
Post 20 Mar 2012, 09:58
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall
I have done one optimization in the most inner loop (use ucomiss instead of cmpps).

Now the results are (Core2 Duo 6300 @ 1.86 GHz):

ASM version:

real 0m1.841s
user 0m1.830s
sys 0m0.010s

C++ version (with -O3 flag):

real 0m1.133s
user 0m1.120s
sys 0m0.010s
Post 20 Mar 2012, 11:18
View user's profile Send private message Visit poster's website Reply with quote
gunblade



Joined: 19 Feb 2004
Posts: 209
gunblade
Was looking at the code last night - but it looks very neat as it is, and theres no obvious cpu stall points that i can see.. Might be worth removing the TGA output code temporary and comparing them again (make sure its not just that you're writting the output file in an unoptimal way - although you're doing one big write, which should be fast - but worth testing to make sure that this is not the cause - just remember to take the TGA output out from both the C and asm versions (wouldnt be fair to only take it out of the asm version Smile))

I was going to use callgrind (from the valgrind package/suite of tools) to try to profile it and see where the slowest sections are.. you may want to try it on both the C and asm versions, and see what it says about execution times of loops/functions.

I'll let you know if i find anything obvious - but as i say, the code looks really good.. I worry it might be an alignment thing since its SSE - but you seem to have done all the alignment required - so would be weird for it not to work well.

EDIT: Well valgrind wasnt useful. It only seperates at call level, not loop level. So for your program (which has very calls), its way too vague. Might have to do it manually by adding calls to the clock_gettime syscall to read the CLOCK_PROCESS_CPUTIME_ID clock which will give "High-resolution per-process timer from the CPU.", so the same as the time command, but can be inserted in various places in your code to find what's taking time. Just careful because this syscall itself will add time to the process's execution time.. so you may only want to put one, moving it around, and only counting UP to this syscall.
Post 20 Mar 2012, 11:37
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.