flat assembler
Message board for the users of flat assembler.

Index > Linux > Mandelbrot renderer

Goto page Previous  1, 2
Author
Thread Post new topic Reply to topic
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall 20 Mar 2012, 15:55
I have uploaded new version. Faster compare and faster sign changing (was mulps now it is just xorps).

real 0m1.764s
user 0m1.750s
sys 0m0.010s
Post 20 Mar 2012, 15:55
View user's profile Send private message Visit poster's website Reply with quote
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall 20 Mar 2012, 16:00
gunblade wrote:
Was looking at the code last night - but it looks very neat as it is, and theres no obvious cpu stall points that i can see.. Might be worth removing the TGA output code temporary and comparing them again (make sure its not just that you're writting the output file in an unoptimal way - although you're doing one big write, which should be fast - but worth testing to make sure that this is not the cause - just remember to take the TGA output out from both the C and asm versions (wouldnt be fair to only take it out of the asm version Smile))

I was going to use callgrind (from the valgrind package/suite of tools) to try to profile it and see where the slowest sections are.. you may want to try it on both the C and asm versions, and see what it says about execution times of loops/functions.

I'll let you know if i find anything obvious - but as i say, the code looks really good.. I worry it might be an alignment thing since its SSE - but you seem to have done all the alignment required - so would be weird for it not to work well.

EDIT: Well valgrind wasnt useful. It only seperates at call level, not loop level. So for your program (which has very calls), its way too vague. Might have to do it manually by adding calls to the clock_gettime syscall to read the CLOCK_PROCESS_CPUTIME_ID clock which will give "High-resolution per-process timer from the CPU.", so the same as the time command, but can be inserted in various places in your code to find what's taking time. Just careful because this syscall itself will add time to the process's execution time.. so you may only want to put one, moving it around, and only counting UP to this syscall.


Thanks. TGA saving isn't a bottleneck.
Post 20 Mar 2012, 16:00
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 21 Mar 2012, 06:45
How much is the compiler faster? Twice as fast or 10% faster? Depending on that, we can start to search for reasons.
Post 21 Mar 2012, 06:45
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall 21 Mar 2012, 09:13
Madis731 wrote:
How much is the compiler faster? Twice as fast or 10% faster? Depending on that, we can start to search for reasons.


Now the results are (Core2 Duo 6300 @ 1.86 GHz):

ASM version:

real 0m1.764s
user 0m1.830s
sys 0m0.010s

C++ version (with -O3 flag):

real 0m1.133s
user 0m1.120s
sys 0m0.010s
Post 21 Mar 2012, 09:13
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 22 Mar 2012, 17:21
randall, thanks for the info on the gpu.
I feel I should help you alittle with your cpu version.
First I should mention
http://board.flatassembler.net/topic.php?t=12722
where I and Kuemmel have creaded a super fast mandelbrot renderer for windows.
For example, the image shown on that page is 1080x1920 and took just 0.05 sec.
If you want to increase the speed you have three independent options:
- multithread (use multiple coures)
- vectorize (use mulps instead of mulss)
- parallelize (unroll loops and handle multiple points per loop)
The downside is that is going to increase the complexity of your code...
Post 22 Mar 2012, 17:21
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 22 Mar 2012, 17:26
For example, this looks esp. slow:
Code:
                ; dz = 2.0 * z * dz + (1.0,0.0)
                movaps      xmm0,xmm14
                movaps      xmm1,xmm15
                shufps      xmm0,xmm0,01000100b
                shufps      xmm1,xmm1,00010100b
                mulps       xmm0,xmm1
                xorps       xmm0,dqword [g_inv_y_sign]
                movaps      xmm1,xmm0
                shufps      xmm0,xmm0,00001000b
                shufps      xmm1,xmm1,00001101b
                addps       xmm0,xmm1
                addps       xmm0,xmm0
                addss       xmm0,[g_1_0]
                movaps      xmm15,xmm0
                ; z = z * z + c
                movaps      xmm0,xmm14
                movaps      xmm1,xmm0
                shufps      xmm0,xmm0,00000100b
                shufps      xmm1,xmm1,01010100b
                mulps       xmm0,xmm1
                xorps       xmm0,dqword [g_inv_y_sign]
                movaps      xmm1,xmm0
                shufps      xmm0,xmm0,00001000b
                shufps      xmm1,xmm1,00001101b
                addps       xmm0,xmm1
                addps       xmm0,xmm13
                movaps      xmm14,xmm0    


I would NOT store the real and imaginary parts in the same vector as then shuf in needed to move things around.
Post 22 Mar 2012, 17:26
View user's profile Send private message Reply with quote
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall 22 Mar 2012, 19:43
tthsqe wrote:
randall, thanks for the info on the gpu.
I feel I should help you alittle with your cpu version.
First I should mention
http://board.flatassembler.net/topic.php?t=12722
where I and Kuemmel have creaded a super fast mandelbrot renderer for windows.
For example, the image shown on that page is 1080x1920 and took just 0.05 sec.
If you want to increase the speed you have three independent options:
- multithread (use multiple coures)
- vectorize (use mulps instead of mulss)
- parallelize (unroll loops and handle multiple points per loop)
The downside is that is going to increase the complexity of your code...


Impressive work. And thanks for the tips.
Post 22 Mar 2012, 19:43
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.