flat assembler
Message board for the users of flat assembler.

Index > Main > Help with converting Mandelbrot C to assembly(speed optim.)

Goto page Previous  1, 2, 3, 4  Next
Author
Thread Post new topic Reply to topic
fredlllll



Joined: 17 Apr 2013
Posts: 56
fredlllll
@_@ guys what r u doin?
@~@ guys stahp!!

i think i dont know enough about processors to understand what you are talking about T^T
Post 04 May 2013, 10:31
View user's profile Send private message Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode
dont worry, such madness Very Happy are only consequences of bad "exercising in cache programming"

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 04 May 2013, 10:49
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17278
Location: In your JS exploiting you and your system
revolution
Well what's important to understand is that just trying to use "faster" instructions is not necessarily going to give you any speed improvement. There are many many other things to consider that go on inside the CPU before one can really start to make proper speed improvements that will last across multiple generations of CPUs.

The point being that avoiding DIV (just an example here), or similar activities, is generally going to give only a very poor improvement in speed and thus not really useful in the whole scheme of things. There are probably much more productive and worthwhile savings that could be made by analysing the data flow. By using up your time to bother with selecting different instruction sequences in the hope of speed improvements is really just distracting you from making lasting improvements with the actual bottleneck of data transfers.
Post 04 May 2013, 10:53
View user's profile Send private message Visit poster's website Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode
revolution wrote:
...just trying to use "faster" instructions is not necessarily going to give you any speed improvement.

exactly, for more than one reason. in the 64bit output i posted above there is no awareness of vectoriality for those double precisions on
Code:
 zrtmp = (zr*zr)-(zi*zi)+xx;
 zi = 2*(zr*zi)+yy;
 (zr*zr+zi*zi))
    
the compiler cannot adjust such simple things. ergo whenever most scalar instruction takes practically same cycles
as vectorial ones, theres is no design there on data management. for this reason that code (practically identical to a 32bit code)
execute only ~3x in comparison with the first FPU version, and not because it is 64bit ! also, never neglect the main rule:
ALGO + DATA STRUCTURE = PROGRAMMING Very Happy

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 04 May 2013, 11:11
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 724
tthsqe
@ fredlllll,
You might have to learn a little more about your cpu if you want to squeeze out of it something close to the maximum performance.
From what you have written so far, I would recommend the following:
1) be sure to eliminate redundant computations in the most critical sections
2) parallelize computations across multiple cores
3) vectorize the floating point computations into the xmm (or ymm) registers
4) add multiple vectors per loop iteration to fill in latencies and take advantage of out-of-order execution on your cpu
5) memory (or data flow) optimizations
I have listed them in order of increasing difficulty, and apparently I have no good ideas on how (5) relates to Mandelbrot. Maybe revolution or hopcode could give some specifics for you relating to the cache.
Post 04 May 2013, 12:51
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17278
Location: In your JS exploiting you and your system
revolution
tthsqe wrote:
I have no good ideas on how [memory (or data flow) optimizations] relates to Mandelbrot.
You already listed some in points 1, 2 and 3. This is all part of the data flow. Both internal within the CPU and external to RAM.

But also to add another is the composition of display frames to DRAM and then later streaming to VRAM. Just this one step alone can give 2 or 3 times more performance than compositing directly to VRAM (depending upon precisely what one is doing).
tthsqe wrote:
You might have to learn a little more about your cpu if you want to squeeze out of it something close to the maximum performance.
A good idea, but ultimately not general purpose enough for any long term benefit. Also the potential gain-to-return-ratio is often very small (i.e. it takes a long time to program a small gain) when dealing with individual CPU differences.
Post 04 May 2013, 13:07
View user's profile Send private message Visit poster's website Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode
here a suitable example resembling the one in this thread. in one iteration i write all the memory USE-SSE, NO-PREFETCH, NO-CORE.
this is accomplished breaking all the memory in blocks of 4kb. number of iterations is 1000.
the first section times the code as it is. the second section times only data access on writing.

Section 1
Code:
 Accesses: 8
 Iteration AVE. Cycles: 4869
 Memory size: 32768 [0x8000]
 Block executes in 608.5 cycles

 Accesses: 16
 Iteration AVE. Cycles: 14612
 Memory size: 65536 [0x10000]
 Block executes in 913.4 cycles

 Accesses: 32
 Iteration AVE. Cycles: 29210
 Memory size: 131072 [0x20000]
 Block executes in 912.26 cycles

 Accesses: 256
 Iteration AVE. Cycles: 223357
 Memory size: 1048576 [0x100000]
 Block executes in 872.125 cycles

 Accesses: 512
 Iteration AVE. Cycles: 633177
 Memory size: 2097152 [0x200000]
 Block executes in 1236.345 cycles

 Accesses: 1024
 Iteration AVE. Cycles: 5555666
 Memory size: 4194304 [0x400000]
 Block executes in 5425.466 cycles

 Accesses: 2048
 Iteration AVE. Cycles: 11026524
 Memory size: 8388608 [0x800000]
 Block executes in 5384.92 cycles
    
Section 2
Code:
 Accesses: 8
 Iteration AVE. Cycles: 4035
 Memory size: 32768 [0x8000]
 Block executes in 504.3 cycles

 Accesses: 16
 Iteration AVE. Cycles: 13782
 Memory size: 65536 [0x10000]
 Block executes in 861.6 cycles

 Accesses: 32
 Iteration AVE. Cycles: 27732
 Memory size: 131072 [0x20000]
 Block executes in 866.20 cycles

 Accesses: 256
 Iteration AVE. Cycles: 211940
 Memory size: 1048576 [0x100000]
 Block executes in 827.228 cycles

 Accesses: 512
 Iteration AVE. Cycles: 603282
 Memory size: 2097152 [0x200000]
 Block executes in 1178.146 cycles

 Accesses: 1024
 Iteration AVE. Cycles: 5480058
 Memory size: 4194304 [0x400000]
 Block executes in 5351.634 cycles

 Accesses: 2048
 Iteration AVE. Cycles: 10939717
 Memory size: 8388608 [0x800000]
 Block executes in 5341.1349 cycles
    
comparisons
Code:
vector size |  code cycles  | data acc.cycles |  diff
------------+---------------+-----------------+---------
 32 kb      |    608.5      |     504.3       |  104.2
 64 kb      |    913.4      |     861.6       |  51.8
128 kb      |    912.26     |     866.20      |  65.06
------------+---------------+-----------------+---------
  1 Mb      |    872.125    |     827.228     |  44.9 
  2 Mb      |    1236.345   |     1178.146    |  58.199
------------+---------------+-----------------+---------
  4 Mb      |    5425.466   |     5351.634    |  73.832
  8 Mb      |    5384.92    |     5341.1349   |  43.78
--------------------------------------------------------
    
1) diff is the bare instructions' timing (practically not influent on the whole)
2) after 32Kb differences between code-cy and data-access-cy are relativaly trascurable
3) à 4Mb and à 8Mb the same code executes ~9x slower than à 32Kb because of data access !!!
4) best stable performances (access/ratio) between 1Mb (1/4 of my L2) and 2Mb
5) differentiating 3 main zones, as in my formulation above,

mem < L1
mem < 1/2 L2
mem > 1/2 L2

Cheers,
Very Happy

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 05 May 2013, 09:45
View user's profile Send private message Visit poster's website Reply with quote
El Tangas



Joined: 11 Oct 2003
Posts: 120
Location: Sunset Empire
El Tangas
Dude, you have set the optimization fanatics in the forum in motion. No sweat, I'll give you some simpler advice Cool

Lets see your inner loop...

Code:
MandIteration: 
        fld st3 ;zr*|yy|xx|zi|zr|R²|xxInc|... 
        fmul st0,st0 ; zr²|yy|xx|zi|zr|R²|xxInc|... 
        fld st3 ; zi*|zr²|yy|xx|zi|zr|R²|xxInc|... 
        fmul st0,st0 ; zi²|zr²|yy|xx|zi|zr|R²|xxInc|... 
        fsubp st1, st0 ; zr²-zi²|yy|xx|zi|zr|R²|xxInc|... 
        fadd st0,st2 ; zr²-zi²+xx|yy|xx|zi|zr|R²|xxInc|... 
         
        fld st4 ; zr*|zr²-zi²+xx|yy|xx|zi|zr|R²|xxInc|... 
        fmul st0, st4 ; zr*zi|zr²-zi²+xx|yy|xx|zi|zr|R²|xxInc|... 
        fadd st0, st0 ; zr*zi*2|zr²-zi²+xx|yy|xx|zi|zr|R²|xxInc|... 
        fadd st0, st2 ; zr*zi*2+yy|zr²-zi²+xx|yy|xx|zi|zr|R²|xxInc|... 
        fstp st4 ; zr²-zi²+xx|yy|xx|zi|zr|R²|xxInc|... 
        fstp st4 ; yy|xx|zi|zr|R²|xxInc|... 
         
        fld st3 ; zr*|yy|xx|zi|zr|R²|xxInc|... 
        fmul st0, st0 ; zr²|yy|xx|zi|zr|R²|xxInc|... 
        fld st3 ; zi*|zr²|yy|xx|zi|zr|R²|xxInc|... 
        fmul st0, st0 ; zi²|zr²|yy|xx|zi|zr|R²|xxInc|... 
        faddp st1, st0 ; zi²+zr²|yy|xx|zi|zr|R²|xxInc|... 
        fld st5 ; R²|zi²+zr²|yy|xx|zi|zr|R²|xxInc|... 
        fcomip st1 ; zi²+zr²|yy|xx|zi|zr|R²|xxInc|... 
        fstp st0 ; yy|xx|zi|zr|R²|xxInc|... fpus may ignore this and just pop for speed causes(??) 
        jb exitMandIteration ; if r less than |Z| exit loop 

        loop MandIteration 
    


first thing, you are using fld and fstp to move registers around. Maybe you can rewrite your code to replace some of them by fxch, which is faster.

Second, and more important, notice that you calculate zr² and zi² twice, first to calculate the next point in the escape orbit, then to calculate R². This takes time uselessly, you should reuse these values, saving 2 fmuls. ( edit: ops, the other way around, first R² then next point. I mean, recycle the values of zr² and zi² for the next cycle)

I think this should be enough to beat the compiler.

Loop unroling (and or vector instructions) would be very nice, but is no trivial task in this case, because the points will not all escape in the same iteration.

Hope this is useful to you.


Last edited by El Tangas on 06 May 2013, 11:17; edited 1 time in total
Post 06 May 2013, 10:57
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17278
Location: In your JS exploiting you and your system
revolution
El Tangas wrote:
first thing, you are using fld and fstp to move registers around. Maybe you can rewrite your code to replace some of them by fxch, which is faster.

Second, and more important, notice that you calculate zr² and zi² twice, first to calculate the next point in the escape orbit, then to calculate R². This takes time uselessly, you should reuse these values, saving 2 fmuls.
Hmm. I think you might not have fully understood some of the above posts. Hehe, just shows how hard proper speed optimisation really is. Without the underlying understanding there is not going to be much chance for success.
Post 06 May 2013, 11:02
View user's profile Send private message Visit poster's website Reply with quote
El Tangas



Joined: 11 Oct 2003
Posts: 120
Location: Sunset Empire
El Tangas
revolution wrote:
Hmm. I think you might not have fully understood some of the above posts. Hehe, just shows how hard proper speed optimisation really is. Without the underlying understanding there is not going to be much chance for success.


Understand? Didn't even read them Twisted Evil

No, seriously, I know optimizing memory access and cache usage is often more important than the actual code. However, in this case, I would bet both are equaly important since Mandel is CPU intensive, there are just some simple stuff in the code that can be optimized, and I think fredlllll wanted simple advice first.
Post 06 May 2013, 11:30
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17278
Location: In your JS exploiting you and your system
revolution
Okay, we will have to wait for fredlllll to respond. But for me a few percent improvement is not interesting when there are potentially much larger gains to be made elsewhere.
Post 06 May 2013, 12:13
View user's profile Send private message Visit poster's website Reply with quote
El Tangas



Joined: 11 Oct 2003
Posts: 120
Location: Sunset Empire
El Tangas
Anyway, I have a couple of questions also regarding the memory access optimizations for revolution or hopcode.

The Mandel routine basicaly only writes to memory, it almost doesn't read any data. So, in this situation, is memory optimization really that important? If so, why? It's not very clear to me, I mean, sure, prefetching data that will be read and needed in the future makes perfect sense to me, but why prefetch writes?
Post 06 May 2013, 13:03
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17278
Location: In your JS exploiting you and your system
revolution
Streaming to VRAM can be much faster than a series of individual writes. Since VRAM is very slow (compared to the CPU and normal DRAM) then reducing bus cycles can be one of the most important things for improving performance. You can try a simple test: Write the computed output to DRAM first and then stream the completed frame to VRAM for display. Is there a large difference? Or is there no noticeable difference? I don't know which this particular code will get but I think a simple test like this would be the most likely place to start when looking to improve performance. And the best part is that such a technique is valid for all CPU and mobo systems and is not tied to any particular test machine that the user may have in front of them.
Post 06 May 2013, 13:27
View user's profile Send private message Visit poster's website Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode
El Tangas wrote:
...is memory optimization really that important ?...
...why prefetch writes?
yes, consider data-access always slow, and slower respectively to the cache hierarchy. the whole depends mostly on data-size.
data-size may lead to rewrite the algo to improve performances. the algo may be rewritten after structuring datas
(and data accesses) in some way.
you dont "prefetch writes", obviously Wink you prefetch datas in order to make datas nearer to the processor,
when reading or when writing, even once,yes, because prefetch is an "hint" to a speculative behaviour.
the above results using 2Mb
Code:
vector size |  code cycles  | data acc.cycles |  diff
------------+---------------+-----------------+---------
  2 Mb      |    1236.345   |     1178.146    |  58.199
    
using a non temporal hint (prefetchnta) on a stride of 4kb reduces data-access cycles to 730, thus ~40% faster !!
because caches are limited containers, you need to do it in some way (for example, using "strides" or threads).
the examples in the http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf
using prefetch lead to 35% improvements in the best cases, 10-15% in worst cases.
this is from my little experience, and i am in no-way that Intel's evangelist/fanatic ehh Wink

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 06 May 2013, 13:54
View user's profile Send private message Visit poster's website Reply with quote
El Tangas



Joined: 11 Oct 2003
Posts: 120
Location: Sunset Empire
El Tangas
Ok, I think I'll run a few tests on modern CPUs, then. Last time I gave some thought to this matter was in the Athlon / P4 era, seems things are even worst now, memory speed is really lagging processor speed...

Edit:
In this Mandelbrot case, the data is writen in a simple sequential way, I thought maybe the hardware prefetcher would be able to predict memory accesses correctly?
Post 06 May 2013, 16:21
View user's profile Send private message Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode
El Tangas wrote:
In this Mandelbrot case, the data is writen in a simple sequential way, I thought maybe the hardware prefetcher would be able to predict memory accesses correctly?
yes, correctly ! as from the manual. non-temporal = write once
i cannot find ATM the latencies of data access for each cache, it's very difficoult ! it is kaos in naming the documents,

for chacheability Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1: Basic Architecture
Chap 10.4.6.2 http://download.intel.com/products/processor/manual/253665.pdf

or A.Fog Optimizing Assembly
http://www.agner.org/optimize/optimizing_assembly.pdf
the end of chapter 12.10 about non-temporal prefetch, and dont forget to follow directive on the cacheline, chapter 11.1
btw. example routine i used for tests as described above.
Code:
;---rax vector
;---ecx = 4096/64 = 64
...
 prefetchnta [rax+1024*4]
 movdqa [rax],xmm0   ;<--- normal move, i.e. not non-temporal move
 movdqa [rax+16],xmm1
 movdqa [rax+32],xmm2
 movdqa [rax+48],xmm3
 add rax,64
 dec ecx
 jnz    .othercalculationtobedone
....
    

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 06 May 2013, 17:14
View user's profile Send private message Visit poster's website Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode
ok, found the latencies table. i know it's difficult because all that ®TM names. a bit of patience Smile and a little organization may help.
the new manual, Intel® 64 and IA-32 Architectures Optimization Reference Manual, Volume A: Chapters 1-13
http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf
use cpu-z to ckeck for the cacheline or follow description on the manual. i quote from an old manual:

1) first check your microarchitecture: mine is Enhanced Intel Core Microarch. chapter 2.1
2) read features: Intel® Advanced Memory Access chapter 2.1.4 + Intel® Advanced Smart Cache
3) go to Load/Store 2.1.5.1 / 2.1.5.2 for that architecture.

and in table 2-4 you read 14 for stores on a WriteBack (WB) memory.
the prefetchnta allow to manage that memory as with WriteCombining (WC) semantics, i.e, weakly ordered.

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 06 May 2013, 18:38
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 724
tthsqe
If I am understanding correctly, in order to address point (5) that I made, hopcode is advocating that instead of doing

Race through the pixels in the picture and write the result to the main buffer as soon as the computation is done

we do

Set up a temporary ~4kb intermediate buffer (on the stack?) and write results to this. Then, write the results of the computations to this temporary buffer, and copy these 4kb chunks from the temporary buffer to the main buffer as the temporary buffer becomes full.

It seems that El Tangas and I are not very impressed by this optimization in the case of Mandelbrot computations. The potential speed gains from addressing points (1)-(4) correctly are surely much greater? (again, I am talking about only the case of Mandelbrot computations!)
Post 06 May 2013, 23:59
View user's profile Send private message Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode
tthsqe wrote:
...Set up a temporary ~4kb intermediate buffer...
that would be a different matter.
ehm... that is a real challenge for me, because i have no intention Smile to write that code, and i will not test any when not worth to be optimized. of course, i didnt say El Tangas's or freddie's code is bad, i say they are both even too much complex, because of FPU set is complex for the case above (against SSE), because of neglecting how the cache works, and the thing getting yet more complex towards an optimization. ok. i will do my best to explain what i mean,eventually Very Happy

_________________
⠓⠕⠏⠉⠕⠙⠑
Post 07 May 2013, 00:29
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2914
Location: [RSP+8*5]
bitRAKE
Code:
@@:                     ; zr|zi|zr²|zi²|cr|ci|4
    fmulp st1,st0       ; zi|zr²|zi²|cr|ci|4
    fadd st0,st4        ; zi|zr²|zi²|cr|ci|4
    fadd st0,st0        ; zi|zr²|zi²|cr|ci|4

    fxch st2            ; zi²|zr²|zi|cr|ci|4
    fsubp st1,st0       ; zr|zi|cr|ci|4
    fadd st0,st2        ; zr|zi|cr|ci|4

    fld st1             ; zi|zr|zi|cr|ci|4
    fmul st2,st0        ; zi|zr|zi²|cr|ci|4
    fld st1             ; zr|zi|zr|zi²|cr|ci|4
    fmul st2,st0        ; zr|zi|zr²|zi²|cr|ci|4
    ; check exit bounds
    fld st3             ; X|zr|zi|zr²|zi²|cr|ci|4
    fadd st0,st3        ; X|zr|zi|zr²|zi²|cr|ci|4
    fcomip st7          ; X|zr|zi|zr²|zi²|cr|ci|4
    jnc .color
    loop @B             ; zr|zi|zr²|zi²|cr|ci|4    
Minimal instruction count - doesn't mean it's the fastest.

Never know where one might need a mandlebrot:
http://rosettacode.org/wiki/Mandelbrot_set Razz

Edit: finally, had a chance to check code - fixed to something that actually works.

_________________
¯\(°_o)/¯ unlicense.org


Last edited by bitRAKE on 07 May 2013, 23:20; edited 1 time in total
Post 07 May 2013, 02:13
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3, 4  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar.

Powered by rwasa.