flat assembler
Message board for the users of flat assembler.
Index
> Main > Help with converting Mandelbrot C to assembly(speed optim.) Goto page Previous 1, 2, 3, 4 Next |
Author |
|
hopcode 07 May 2013, 11:52
ok, first of all i think that is great stuff, and i told you already.
that i am not a graph-geek, you know it already. experienced some crashes, Access Violation after resizing window, presumably in the .colorloop ; write to ppvBits right here movd dword[rbx],xmm7 , but that is not the matter at the moment. now i will do my best for some suggestions, and if i am wrong or seems doesnt fit what i say , please dont hesitate to tell me. 1) compact types of memory Code: a) constants like const_f8_1 should be ALL hardcoded and aligned in the code b) colors, being 1024 bytes can stay on thread stack (there you have enough place sub rsp,32*64) or OutputBuffer c) static variable like Bitmap.x on stack or OutputBuffer. to avoid cache invalidation every time discarding part of OutputBuffer d) coords being 8kb and accessed frequently may stay on the OutputBuffer. allocate it larger c) for other user-variable like zoom, should be copied on thread stack. find a suitable way to modify them (example freeze/update/resume loop) or OutputBuffer Code: a) for such an output display size, Single Precision should be enough (tell me here, please) instruction are DP but calculations are SP. also, you know, it may work 4x SP, and 8x AVX b) please solve one-for-all that 2 x FYL2X. it is the 1/3 of all cycles. unrolling required, or call 2x an alternative function using polynomials or table, example from here http://jrfonseca.blogspot.de/2008/09/fast-sse2-pow-tables-or-polynomials.html or http://devmaster.net/forums/topic/6679-approximate-math-library/ but suggested Intel Aproximation Math Library opensource. i cannot find it ATM on the net to be downloaded (and i think to know why . i have it, as i find it i may share it via PM. c) FPU functions should not survive in a SSE context, imo. you use it already in the FPU version of FractalExplorer and ok, they do not interfere with XMM but... it is a matter of design. and it seems you cannot get rid of that FPU stuff here using SSE 128kb or 256kb. à 256 x 256 we have already a vector of 256kb, already unmanageable, because of too much frequent invalidation of the cache. it warms the CPU too much imo, à 800 depth already. Cheers, _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
07 May 2013, 11:52 |
|
fredlllll 07 May 2013, 15:43
@~@ pheeew
you are always talking about dram and vram. what exactly do you mean with dram? i thought the only thing i can use in user mode would be vram. hopcode you asked if single would be enough("for such an output display size, Single Precision should be enough (tell me here, please)"). maybe in the case of that picture above but not in general. i made a mandelbrot shader for directX which is limited to float (shader model 3) and after a short while of zooming in i get artifacts(first looks like color splatters and then like rectangles). although the viewable space of the fractal is enourmous i really recommend double. a gpu really is a monster in such things. the fractal runs at 60fps with ~400 iterations and 1680x1050 px. if there is interest i could upload it. btw i want to set up a cloud service for calculating fractals (somewhen) or only the mandelbrot and use the google maps api to zoom in and stuff. that would be one cause why i want to optimized this. would someone be interested in this project? |
|||
07 May 2013, 15:43 |
|
hopcode 07 May 2013, 17:40
freddy wrote: i get artifacts(first looks like color splatters and then like rectangles).... Quote: ...a gpu really is a monster in such things. the fractal runs at 60fps with ~400 iterations and 1680x1050 px... i have found the Intel Aprox Math lib. if someone interested i can PM attaching it. i will be away till tomorrow afternoon... for a ehm... performance session... Cheers,
_________________ ⠓⠕⠏⠉⠕⠙⠑ |
||||||||||
07 May 2013, 17:40 |
|
fredlllll 07 May 2013, 18:15
seems i lied a little.
1000 iterations and still 60 fps if i dont have to much "max iteration"-pixels on the screen. viewing the full mandelbrot at 1280x720 i get ~25 fps. so the gpu is ~150 times faster i didnt zoom as far in as i could, because i thought this would look better . but you can still see the rectangle shapes.(2nd screenshot) when calculating the julia fractal i noticed, that sometimes the pattern is a bit skewed(1st screenshot) with "i get artifacts(first looks like color splatters and then like rectangles)...." i meant when zooming in. as i use the gpu i can zoom in and out realtime. changing the c of the julia fractal in realtime really looks cool (here is a video of the shader using randomly generated C's and changing its gradient http://www.youtube.com/watch?v=5xj9qF5vNnM )
_________________ --for science |
|||||||||||||||||||
07 May 2013, 18:15 |
|
tthsqe 08 May 2013, 03:17
fredlllll,
are you interested in writing code for CPUs or GPUs? GPUs tend to be monsters at single precision an compatible with CPUs at double precision. Here is a comparison of what I obtained with my own tests on a 2600K (SSe and AVX) and a GTX 580 (each iteration is about 8 flops): Code: double precision: SSE: 6.4 giga-iterations per second AVX: 12.7 giga-iterations per second GTX 580: 13.8 giga-iterations per second single precision: GTX 580: 76.3 giga-iterations per second (about 0.6 tera-flops!) |
|||
08 May 2013, 03:17 |
|
ASM-Man 08 May 2013, 03:24
You're fantastic guys.
|
|||
08 May 2013, 03:24 |
|
fredlllll 08 May 2013, 09:36
tthsqe,
interested in both, but unfortunately writing code for gpus is not that easy. nvida uses cuda, while ati uses opencl. i dont know which of them i should prefer(once worked with cuda and i didnt like it). i made my implementation with a shader(HLSL), so the interpolation between the 4 corners is done by the gpus rasterizer. so maybe shaders are faster? did you make this with cuda? i dont know how the performance difference is between cuda and standard shaders. or if there is any. btw i made my tests with an ati hd6950 2 Gb and i still dont know what you mean with dram |
|||
08 May 2013, 09:36 |
|
nop 08 May 2013, 10:02
fredlllll wrote: and i still dont know what you mean with dram |
|||
08 May 2013, 10:02 |
|
fredlllll 09 May 2013, 16:44
i dont think they mean video ram when talking about optimizing the memory access...
|
|||
09 May 2013, 16:44 |
|
revolution 09 May 2013, 23:25
VRAM (video ram) is generally slow so optimising writes to VRAM is important if your goal to to make it as fast as possible.
|
|||
09 May 2013, 23:25 |
|
fredlllll 10 May 2013, 07:57
but in the case of my code im not writing to vram. so why talking about it??
_________________ --for science |
|||
10 May 2013, 07:57 |
|
revolution 10 May 2013, 10:00
Okay, if you are not writing to vram (i.e. you are not displaying the pictures) then why the optimisation? You can just do the rendering overnight(s) and forget about making it 1% faster.
|
|||
10 May 2013, 10:00 |
|
fredlllll 10 May 2013, 12:17
first i just want to optimize for the fun of it.
and second i want to make a cloud service for calculating fractals and this should of course not waste time. because 1% with 20000 participants will do something. |
|||
10 May 2013, 12:17 |
|
revolution 10 May 2013, 12:20
fredlllll wrote: first i just want to optimize for the fun of it. fredlllll wrote: and second i want to make a cloud service for calculating fractals and this should of course not waste time. because 1% with 20000 participants will do something. |
|||
10 May 2013, 12:20 |
|
fredlllll 10 May 2013, 14:50
"i want to make" == future. but i expect this many users. and of course i'd like to involve you asm guys, because your skill is way higher than mine ^__^
|
|||
10 May 2013, 14:50 |
|
hopcode 12 May 2013, 00:59
colours maps to iterations on that pixel. having say a 1024 bits precision per pixel,
we would not be able to represent it "truly" on our normal display, being normally max 32bpp. a cloud-server is of course not pointless (you can calculate the set actually) but unuseful, being video-cards AFAIK uncapable of that high bpp colouring without interpolations and, whenever zooming, the color there may belong to a larger palette than 32bpp. for the same reason, an explorer using say 256 colours in 16 iterations on the same surface works without awareness of the mandelbrot properties. Cheers, _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
12 May 2013, 00:59 |
|
hopcode 12 May 2013, 08:34
i would like to stay single precision and 24bpp, but that is IMO not possible without huge approx. error.
Code: +1.99999988 <- 0x3FFFFFFF +2.00000000 <- 0x40000000 +2.00000024 <- 0x40000001 a custom decimal fixed-point using the sign-bit at the 31st, just like our dword and scaling factor from 10^-1 to max 10^-9. this may give some advantages: - to use the full 24bpp palette usigned, or the full but signed 32bpp - normal add/sub/div/mul operations with a min -2.147483647 and a max at +2.147483646 (at scaling factor 10^-9) just for the purpouse of the magnitude of a point in our set - zooming by increasing/decreasing the scaling factor accordingly - auto-carry on sign, axis - auto-overflow around 4.0 _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
12 May 2013, 08:34 |
|
fredlllll 12 May 2013, 09:07
huh? i dont know how you map your colors, but i only take the number of iterations it took |Z| to get bigger than a limit. so with 1000 iterations i only have 1000 possible colors. which is pretty much possible for todays graphiccards and displays
and i want to use the google maps api for the cloud. so people can zoom in like in google maps. so for example you can view a fractal which uses 100000 iterations without any problems. of course we need some decent harddrives for that amount of data but i think software comes first should i post this in the idea forum? |
|||
12 May 2013, 09:07 |
|
revolution 12 May 2013, 09:14
fredlllll: The problem with single precision comes when you zoom in.
There are a number of programs available that do deep zooming on these sorts of fractals that use arbitrary precision computations to support the necessary level of detail separation. With single precision a small section when zoomed in can all end up being the same coordinates without any differentiation and the entire image is one uniform colour. |
|||
12 May 2013, 09:14 |
|
Goto page Previous 1, 2, 3, 4 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.