flat assembler
Message board for the users of flat assembler.

Index > Main > Help with converting Mandelbrot C to assembly(speed optim.)

Goto page Previous  1, 2, 3, 4  Next
Author
Thread Post new topic Reply to topic
tthsqe



Joined: 20 May 2009
Posts: 730
tthsqe
hopcode, its time to actually test some code.
Here I have a reasonable and readable implementation of points (1)-(4), and I have completely ignored memory considerations.
Could you do whatever you think you can to speed it up?
The source should not be difficult to modify.
Let me know if it crashes on your machine, for I may have used some instructions not supported by your CPU.
Post 07 May 2013, 03:17
View user's profile Send private message Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode
ok, first of all i think that is great stuff, and i told you already.
that i am not a graph-geek, you know it already. experienced some crashes, Access Violation after resizing window, presumably in the .colorloop
; write to ppvBits right here movd dword[rbx],xmm7 , but that is not the matter at the moment.
now i will do my best for some suggestions, and if i am wrong or seems doesnt fit what i say Smile , please dont hesitate to tell me.

1) compact types of memory
Code:
 a) constants like const_f8_1 should be ALL hardcoded and aligned in the code
 b) colors, being 1024 bytes can stay on thread stack (there you have enough place sub rsp,32*64) or OutputBuffer
 c) static variable like Bitmap.x on stack  or OutputBuffer. to avoid cache invalidation every time discarding part of OutputBuffer
 d) coords being 8kb and accessed frequently may stay on the OutputBuffer. allocate it larger
 c) for other user-variable like zoom, should be copied on thread stack. find a suitable way to modify them 
   (example freeze/update/resume loop) or OutputBuffer
    
2) coding
Code:
 a) for such an output display size, Single Precision should be enough (tell me here, please)
    instruction are DP but calculations are SP. also, you know, it may work 4x SP, and 8x AVX

 b) please solve one-for-all that 2 x FYL2X. it is the 1/3 of all cycles. 
    unrolling required, or call 2x an alternative function using polynomials or table,
    example from here
    http://jrfonseca.blogspot.de/2008/09/fast-sse2-pow-tables-or-polynomials.html
    or
    http://devmaster.net/forums/topic/6679-approximate-math-library/
    but suggested Intel Aproximation Math Library opensource. i cannot
    find it ATM on the net to be downloaded (and i think to know why Sad . i have it, as i find it i may share it via PM.

 c) FPU functions should not survive in a SSE context, imo. you use it already in the FPU version of
    FractalExplorer and ok, they do not interfere with XMM but... it is a matter
    of design. and it seems you cannot get rid of that FPU stuff here using SSE Wink    
after this, you may think to layout/acces data in some way among L1 and L2. you know the L1 is limited
128kb or 256kb. à 256 x 256 we have already a vector of 256kb, already unmanageable,
because of too much frequent invalidation of the cache. it warms the CPU too much imo, à 800 depth already.

Cheers,
Very Happy

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 07 May 2013, 11:52
View user's profile Send private message Visit poster's website Reply with quote
fredlllll



Joined: 17 Apr 2013
Posts: 56
fredlllll
@~@ pheeew
you are always talking about dram and vram. what exactly do you mean with dram? i thought the only thing i can use in user mode would be vram.


hopcode you asked if single would be enough("for such an output display size, Single Precision should be enough (tell me here, please)"). maybe in the case of that picture above but not in general. i made a mandelbrot shader for directX which is limited to float (shader model 3) and after a short while of zooming in i get artifacts(first looks like color splatters and then like rectangles). although the viewable space of the fractal is enourmous i really recommend double.
a gpu really is a monster in such things. the fractal runs at 60fps with ~400 iterations and 1680x1050 px. if there is interest i could upload it.

btw i want to set up a cloud service for calculating fractals (somewhen) or only the mandelbrot and use the google maps api to zoom in and stuff.
that would be one cause why i want to optimized this. would someone be interested in this project?
Post 07 May 2013, 15:43
View user's profile Send private message Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode
freddy wrote:
i get artifacts(first looks like color splatters and then like rectangles)....
after 5,6 seconds i got that pitcure below
Quote:
...a gpu really is a monster in such things. the fractal runs at 60fps with ~400 iterations and 1680x1050 px...
can you tell me how is running the Mandelbrot Explorer in comparison ?
i have found the Intel Aprox Math lib. if someone interested i can PM attaching it. i will be away till tomorrow afternoon... for a ehm... performance session... Cool
Cheers,
Very Happy


Description:
Filesize: 88.04 KB
Viewed: 5980 Time(s)

artifacts.jpg



_________________
⠓⠕⠏⠉⠕⠙⠑
Post 07 May 2013, 17:40
View user's profile Send private message Visit poster's website Reply with quote
fredlllll



Joined: 17 Apr 2013
Posts: 56
fredlllll
seems i lied a little.
1000 iterations and still 60 fps if i dont have to much "max iteration"-pixels on the screen. viewing the full mandelbrot at 1280x720 i get ~25 fps. so the gpu is ~150 times faster Razz i didnt zoom as far in as i could, because i thought this would look better Wink. but you can still see the rectangle shapes.(2nd screenshot)

when calculating the julia fractal i noticed, that sometimes the pattern is a bit skewed(1st screenshot)

with "i get artifacts(first looks like color splatters and then like rectangles)...." i meant when zooming in. as i use the gpu i can zoom in and out realtime. changing the c of the julia fractal in realtime really looks cool (here is a video of the shader using randomly generated C's and changing its gradient http://www.youtube.com/watch?v=5xj9qF5vNnM )


Description:
Filesize: 255.01 KB
Viewed: 5977 Time(s)

gpu2.png


Description:
Filesize: 175.32 KB
Viewed: 5975 Time(s)

gpu.png



_________________
--for science
Post 07 May 2013, 18:15
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 730
tthsqe
fredlllll,
are you interested in writing code for CPUs or GPUs?
GPUs tend to be monsters at single precision an compatible with CPUs at double precision.

Here is a comparison of what I obtained with my own tests on a 2600K (SSe and AVX) and a GTX 580 (each iteration is about 8 flops):
Code:
double precision:
SSE: 6.4 giga-iterations per second
AVX: 12.7 giga-iterations per second
GTX 580: 13.8 giga-iterations per second

single precision:
GTX 580: 76.3 giga-iterations per second (about 0.6 tera-flops!)
    
Post 08 May 2013, 03:17
View user's profile Send private message Reply with quote
ASM-Man



Joined: 11 Jan 2013
Posts: 65
ASM-Man
You're fantastic guys.
Post 08 May 2013, 03:24
View user's profile Send private message Reply with quote
fredlllll



Joined: 17 Apr 2013
Posts: 56
fredlllll
tthsqe,

interested in both, but unfortunately writing code for gpus is not that easy. nvida uses cuda, while ati uses opencl. i dont know which of them i should prefer(once worked with cuda and i didnt like it). i made my implementation with a shader(HLSL), so the interpolation between the 4 corners is done by the gpus rasterizer. so maybe shaders are faster?

did you make this with cuda?
i dont know how the performance difference is between cuda and standard shaders. or if there is any.

btw i made my tests with an ati hd6950 2 Gb

and i still dont know what you mean with dram Wink
Post 08 May 2013, 09:36
View user's profile Send private message Reply with quote
nop



Joined: 01 Sep 2008
Posts: 165
Location: right here left there
nop
fredlllll wrote:
and i still dont know what you mean with dram Wink
dram is dynamic ram ordinary user memory can also be called vram if shared but true vram is video ram on video card dual ported for cpu acess and gpu acess at the same time
Post 08 May 2013, 10:02
View user's profile Send private message Reply with quote
fredlllll



Joined: 17 Apr 2013
Posts: 56
fredlllll
i dont think they mean video ram when talking about optimizing the memory access...
Post 09 May 2013, 16:44
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17717
Location: In your JS exploiting you and your system
revolution
VRAM (video ram) is generally slow so optimising writes to VRAM is important if your goal to to make it as fast as possible.
Post 09 May 2013, 23:25
View user's profile Send private message Visit poster's website Reply with quote
fredlllll



Joined: 17 Apr 2013
Posts: 56
fredlllll
but in the case of my code im not writing to vram. so why talking about it??

_________________
--for science
Post 10 May 2013, 07:57
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17717
Location: In your JS exploiting you and your system
revolution
Okay, if you are not writing to vram (i.e. you are not displaying the pictures) then why the optimisation? You can just do the rendering overnight(s) and forget about making it 1% faster.
Post 10 May 2013, 10:00
View user's profile Send private message Visit poster's website Reply with quote
fredlllll



Joined: 17 Apr 2013
Posts: 56
fredlllll
first i just want to optimize for the fun of it.
and second i want to make a cloud service for calculating fractals and this should of course not waste time. because 1% with 20000 participants will do something.
Post 10 May 2013, 12:17
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17717
Location: In your JS exploiting you and your system
revolution
fredlllll wrote:
first i just want to optimize for the fun of it.
Okay.
fredlllll wrote:
and second i want to make a cloud service for calculating fractals and this should of course not waste time. because 1% with 20000 participants will do something.
20k users is a lot. Can you share the URL with us?
Post 10 May 2013, 12:20
View user's profile Send private message Visit poster's website Reply with quote
fredlllll



Joined: 17 Apr 2013
Posts: 56
fredlllll
"i want to make" == future. but i expect this many users. and of course i'd like to involve you asm guys, because your skill is way higher than mine ^__^
Post 10 May 2013, 14:50
View user's profile Send private message Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode
colours maps to iterations on that pixel. having say a 1024 bits precision per pixel,
we would not be able to represent it "truly" on our normal display, being normally max 32bpp.

a cloud-server is of course not pointless (you can calculate the set actually)
but unuseful, being video-cards AFAIK uncapable of that high bpp colouring without interpolations and,
whenever zooming, the color there may belong to a larger palette than 32bpp.

for the same reason, an explorer using say 256 colours in 16 iterations on the same surface
works without awareness of the mandelbrot properties.
Cheers,
Very Happy

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 12 May 2013, 00:59
View user's profile Send private message Visit poster's website Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode
i would like to stay single precision and 24bpp, but that is IMO not possible without huge approx. error.
Code:
+1.99999988 <- 0x3FFFFFFF
+2.00000000 <- 0x40000000
+2.00000024 <- 0x40000001
    
then consider those numbers in c will be squared lot of times. one solution may be
a custom decimal fixed-point using the sign-bit at the 31st, just like our dword and scaling factor
from 10^-1 to max 10^-9. this may give some advantages:

    - to use the full 24bpp palette usigned, or the full but signed 32bpp
    - normal add/sub/div/mul operations with a min -2.147483647 and a max at +2.147483646 (at scaling factor 10^-9)
    just for the purpouse of the magnitude of a point in our set
    - zooming by increasing/decreasing the scaling factor accordingly
    - auto-carry on sign, axis
    - auto-overflow around 4.0

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 12 May 2013, 08:34
View user's profile Send private message Visit poster's website Reply with quote
fredlllll



Joined: 17 Apr 2013
Posts: 56
fredlllll
huh? i dont know how you map your colors, but i only take the number of iterations it took |Z| to get bigger than a limit. so with 1000 iterations i only have 1000 possible colors. which is pretty much possible for todays graphiccards and displays Razz
and i want to use the google maps api for the cloud. so people can zoom in like in google maps. so for example you can view a fractal which uses 100000 iterations without any problems. of course we need some decent harddrives for that amount of data Razz but i think software comes first Wink
should i post this in the idea forum?
Post 12 May 2013, 09:07
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17717
Location: In your JS exploiting you and your system
revolution
fredlllll: The problem with single precision comes when you zoom in.

There are a number of programs available that do deep zooming on these sorts of fractals that use arbitrary precision computations to support the necessary level of detail separation. With single precision a small section when zoomed in can all end up being the same coordinates without any differentiation and the entire image is one uniform colour.
Post 12 May 2013, 09:14
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3, 4  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.