flat assembler
Message board for the users of flat assembler.
![]() Goto page Previous 1, 2, 3, 4 Next |
Author |
|
fredlllll 04 May 2013, 10:31
@_@ guys what r u doin?
@~@ guys stahp!! i think i dont know enough about processors to understand what you are talking about T^T |
|||
![]() |
|
hopcode 04 May 2013, 10:49
dont worry, such madness
![]() _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
![]() |
|
hopcode 04 May 2013, 11:11
revolution wrote: ...just trying to use "faster" instructions is not necessarily going to give you any speed improvement. exactly, for more than one reason. in the 64bit output i posted above there is no awareness of vectoriality for those double precisions on Code: zrtmp = (zr*zr)-(zi*zi)+xx; zi = 2*(zr*zi)+yy; (zr*zr+zi*zi)) as vectorial ones, theres is no design there on data management. for this reason that code (practically identical to a 32bit code) execute only ~3x in comparison with the first FPU version, and not because it is 64bit ! also, never neglect the main rule: ALGO + DATA STRUCTURE = PROGRAMMING ![]() _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
![]() |
|
tthsqe 04 May 2013, 12:51
@ fredlllll,
You might have to learn a little more about your cpu if you want to squeeze out of it something close to the maximum performance. From what you have written so far, I would recommend the following: 1) be sure to eliminate redundant computations in the most critical sections 2) parallelize computations across multiple cores 3) vectorize the floating point computations into the xmm (or ymm) registers 4) add multiple vectors per loop iteration to fill in latencies and take advantage of out-of-order execution on your cpu 5) memory (or data flow) optimizations I have listed them in order of increasing difficulty, and apparently I have no good ideas on how (5) relates to Mandelbrot. Maybe revolution or hopcode could give some specifics for you relating to the cache. |
|||
![]() |
|
revolution 04 May 2013, 13:07
tthsqe wrote: I have no good ideas on how [memory (or data flow) optimizations] relates to Mandelbrot. But also to add another is the composition of display frames to DRAM and then later streaming to VRAM. Just this one step alone can give 2 or 3 times more performance than compositing directly to VRAM (depending upon precisely what one is doing). tthsqe wrote: You might have to learn a little more about your cpu if you want to squeeze out of it something close to the maximum performance. |
|||
![]() |
|
hopcode 05 May 2013, 09:45
here a suitable example resembling the one in this thread. in one iteration i write all the memory USE-SSE, NO-PREFETCH, NO-CORE.
this is accomplished breaking all the memory in blocks of 4kb. number of iterations is 1000. the first section times the code as it is. the second section times only data access on writing. Section 1 Code: Accesses: 8 Iteration AVE. Cycles: 4869 Memory size: 32768 [0x8000] Block executes in 608.5 cycles Accesses: 16 Iteration AVE. Cycles: 14612 Memory size: 65536 [0x10000] Block executes in 913.4 cycles Accesses: 32 Iteration AVE. Cycles: 29210 Memory size: 131072 [0x20000] Block executes in 912.26 cycles Accesses: 256 Iteration AVE. Cycles: 223357 Memory size: 1048576 [0x100000] Block executes in 872.125 cycles Accesses: 512 Iteration AVE. Cycles: 633177 Memory size: 2097152 [0x200000] Block executes in 1236.345 cycles Accesses: 1024 Iteration AVE. Cycles: 5555666 Memory size: 4194304 [0x400000] Block executes in 5425.466 cycles Accesses: 2048 Iteration AVE. Cycles: 11026524 Memory size: 8388608 [0x800000] Block executes in 5384.92 cycles Code: Accesses: 8 Iteration AVE. Cycles: 4035 Memory size: 32768 [0x8000] Block executes in 504.3 cycles Accesses: 16 Iteration AVE. Cycles: 13782 Memory size: 65536 [0x10000] Block executes in 861.6 cycles Accesses: 32 Iteration AVE. Cycles: 27732 Memory size: 131072 [0x20000] Block executes in 866.20 cycles Accesses: 256 Iteration AVE. Cycles: 211940 Memory size: 1048576 [0x100000] Block executes in 827.228 cycles Accesses: 512 Iteration AVE. Cycles: 603282 Memory size: 2097152 [0x200000] Block executes in 1178.146 cycles Accesses: 1024 Iteration AVE. Cycles: 5480058 Memory size: 4194304 [0x400000] Block executes in 5351.634 cycles Accesses: 2048 Iteration AVE. Cycles: 10939717 Memory size: 8388608 [0x800000] Block executes in 5341.1349 cycles Code: vector size | code cycles | data acc.cycles | diff ------------+---------------+-----------------+--------- 32 kb | 608.5 | 504.3 | 104.2 64 kb | 913.4 | 861.6 | 51.8 128 kb | 912.26 | 866.20 | 65.06 ------------+---------------+-----------------+--------- 1 Mb | 872.125 | 827.228 | 44.9 2 Mb | 1236.345 | 1178.146 | 58.199 ------------+---------------+-----------------+--------- 4 Mb | 5425.466 | 5351.634 | 73.832 8 Mb | 5384.92 | 5341.1349 | 43.78 -------------------------------------------------------- 2) after 32Kb differences between code-cy and data-access-cy are relativaly trascurable 3) à 4Mb and à 8Mb the same code executes ~9x slower than à 32Kb because of data access !!! 4) best stable performances (access/ratio) between 1Mb (1/4 of my L2) and 2Mb 5) differentiating 3 main zones, as in my formulation above, mem < L1 mem < 1/2 L2 mem > 1/2 L2 Cheers, ![]() _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
![]() |
|
El Tangas 06 May 2013, 10:57
Dude, you have set the optimization fanatics in the forum in motion. No sweat, I'll give you some simpler advice
![]() Lets see your inner loop... Code: MandIteration: fld st3 ;zr*|yy|xx|zi|zr|R²|xxInc|... fmul st0,st0 ; zr²|yy|xx|zi|zr|R²|xxInc|... fld st3 ; zi*|zr²|yy|xx|zi|zr|R²|xxInc|... fmul st0,st0 ; zi²|zr²|yy|xx|zi|zr|R²|xxInc|... fsubp st1, st0 ; zr²-zi²|yy|xx|zi|zr|R²|xxInc|... fadd st0,st2 ; zr²-zi²+xx|yy|xx|zi|zr|R²|xxInc|... fld st4 ; zr*|zr²-zi²+xx|yy|xx|zi|zr|R²|xxInc|... fmul st0, st4 ; zr*zi|zr²-zi²+xx|yy|xx|zi|zr|R²|xxInc|... fadd st0, st0 ; zr*zi*2|zr²-zi²+xx|yy|xx|zi|zr|R²|xxInc|... fadd st0, st2 ; zr*zi*2+yy|zr²-zi²+xx|yy|xx|zi|zr|R²|xxInc|... fstp st4 ; zr²-zi²+xx|yy|xx|zi|zr|R²|xxInc|... fstp st4 ; yy|xx|zi|zr|R²|xxInc|... fld st3 ; zr*|yy|xx|zi|zr|R²|xxInc|... fmul st0, st0 ; zr²|yy|xx|zi|zr|R²|xxInc|... fld st3 ; zi*|zr²|yy|xx|zi|zr|R²|xxInc|... fmul st0, st0 ; zi²|zr²|yy|xx|zi|zr|R²|xxInc|... faddp st1, st0 ; zi²+zr²|yy|xx|zi|zr|R²|xxInc|... fld st5 ; R²|zi²+zr²|yy|xx|zi|zr|R²|xxInc|... fcomip st1 ; zi²+zr²|yy|xx|zi|zr|R²|xxInc|... fstp st0 ; yy|xx|zi|zr|R²|xxInc|... fpus may ignore this and just pop for speed causes(??) jb exitMandIteration ; if r less than |Z| exit loop loop MandIteration first thing, you are using fld and fstp to move registers around. Maybe you can rewrite your code to replace some of them by fxch, which is faster. Second, and more important, notice that you calculate zr² and zi² twice, first to calculate the next point in the escape orbit, then to calculate R². This takes time uselessly, you should reuse these values, saving 2 fmuls. ( edit: ops, the other way around, first R² then next point. I mean, recycle the values of zr² and zi² for the next cycle) I think this should be enough to beat the compiler. Loop unroling (and or vector instructions) would be very nice, but is no trivial task in this case, because the points will not all escape in the same iteration. Hope this is useful to you. Last edited by El Tangas on 06 May 2013, 11:17; edited 1 time in total |
|||
![]() |
|
revolution 06 May 2013, 11:02
El Tangas wrote: first thing, you are using fld and fstp to move registers around. Maybe you can rewrite your code to replace some of them by fxch, which is faster. |
|||
![]() |
|
El Tangas 06 May 2013, 11:30
revolution wrote: Hmm. I think you might not have fully understood some of the above posts. Hehe, just shows how hard proper speed optimisation really is. Without the underlying understanding there is not going to be much chance for success. Understand? Didn't even read them ![]() No, seriously, I know optimizing memory access and cache usage is often more important than the actual code. However, in this case, I would bet both are equaly important since Mandel is CPU intensive, there are just some simple stuff in the code that can be optimized, and I think fredlllll wanted simple advice first. |
|||
![]() |
|
revolution 06 May 2013, 12:13
Okay, we will have to wait for fredlllll to respond. But for me a few percent improvement is not interesting when there are potentially much larger gains to be made elsewhere.
|
|||
![]() |
|
El Tangas 06 May 2013, 13:03
Anyway, I have a couple of questions also regarding the memory access optimizations for revolution or hopcode.
The Mandel routine basicaly only writes to memory, it almost doesn't read any data. So, in this situation, is memory optimization really that important? If so, why? It's not very clear to me, I mean, sure, prefetching data that will be read and needed in the future makes perfect sense to me, but why prefetch writes? |
|||
![]() |
|
revolution 06 May 2013, 13:27
Streaming to VRAM can be much faster than a series of individual writes. Since VRAM is very slow (compared to the CPU and normal DRAM) then reducing bus cycles can be one of the most important things for improving performance. You can try a simple test: Write the computed output to DRAM first and then stream the completed frame to VRAM for display. Is there a large difference? Or is there no noticeable difference? I don't know which this particular code will get but I think a simple test like this would be the most likely place to start when looking to improve performance. And the best part is that such a technique is valid for all CPU and mobo systems and is not tied to any particular test machine that the user may have in front of them.
|
|||
![]() |
|
hopcode 06 May 2013, 13:54
El Tangas wrote: ...is memory optimization really that important ?... data-size may lead to rewrite the algo to improve performances. the algo may be rewritten after structuring datas (and data accesses) in some way. you dont "prefetch writes", obviously ![]() when reading or when writing, even once,yes, because prefetch is an "hint" to a speculative behaviour. the above results using 2Mb Code: vector size | code cycles | data acc.cycles | diff ------------+---------------+-----------------+--------- 2 Mb | 1236.345 | 1178.146 | 58.199 because caches are limited containers, you need to do it in some way (for example, using "strides" or threads). the examples in the http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf using prefetch lead to 35% improvements in the best cases, 10-15% in worst cases. this is from my little experience, and i am in no-way that Intel's evangelist/fanatic ehh ![]() _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
![]() |
|
El Tangas 06 May 2013, 16:21
Ok, I think I'll run a few tests on modern CPUs, then. Last time I gave some thought to this matter was in the Athlon / P4 era, seems things are even worst now, memory speed is really lagging processor speed...
Edit: In this Mandelbrot case, the data is writen in a simple sequential way, I thought maybe the hardware prefetcher would be able to predict memory accesses correctly? |
|||
![]() |
|
hopcode 06 May 2013, 17:14
El Tangas wrote: In this Mandelbrot case, the data is writen in a simple sequential way, I thought maybe the hardware prefetcher would be able to predict memory accesses correctly? i cannot find ATM the latencies of data access for each cache, it's very difficoult ! it is kaos in naming the documents, for chacheability Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1: Basic Architecture Chap 10.4.6.2 http://download.intel.com/products/processor/manual/253665.pdf or A.Fog Optimizing Assembly http://www.agner.org/optimize/optimizing_assembly.pdf the end of chapter 12.10 about non-temporal prefetch, and dont forget to follow directive on the cacheline, chapter 11.1 btw. example routine i used for tests as described above. Code: ;---rax vector ;---ecx = 4096/64 = 64 ... prefetchnta [rax+1024*4] movdqa [rax],xmm0 ;<--- normal move, i.e. not non-temporal move movdqa [rax+16],xmm1 movdqa [rax+32],xmm2 movdqa [rax+48],xmm3 add rax,64 dec ecx jnz .othercalculationtobedone .... _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
![]() |
|
hopcode 06 May 2013, 18:38
ok, found the latencies table. i know it's difficult because all that ®TM names. a bit of patience
![]() the new manual, Intel® 64 and IA-32 Architectures Optimization Reference Manual, Volume A: Chapters 1-13 http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf use cpu-z to ckeck for the cacheline or follow description on the manual. i quote from an old manual: 1) first check your microarchitecture: mine is Enhanced Intel Core Microarch. chapter 2.1 2) read features: Intel® Advanced Memory Access chapter 2.1.4 + Intel® Advanced Smart Cache 3) go to Load/Store 2.1.5.1 / 2.1.5.2 for that architecture. and in table 2-4 you read 14 for stores on a WriteBack (WB) memory. the prefetchnta allow to manage that memory as with WriteCombining (WC) semantics, i.e, weakly ordered. _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
![]() |
|
tthsqe 06 May 2013, 23:59
If I am understanding correctly, in order to address point (5) that I made, hopcode is advocating that instead of doing
Race through the pixels in the picture and write the result to the main buffer as soon as the computation is done we do Set up a temporary ~4kb intermediate buffer (on the stack?) and write results to this. Then, write the results of the computations to this temporary buffer, and copy these 4kb chunks from the temporary buffer to the main buffer as the temporary buffer becomes full. It seems that El Tangas and I are not very impressed by this optimization in the case of Mandelbrot computations. The potential speed gains from addressing points (1)-(4) correctly are surely much greater? (again, I am talking about only the case of Mandelbrot computations!) |
|||
![]() |
|
hopcode 07 May 2013, 00:29
tthsqe wrote: ...Set up a temporary ~4kb intermediate buffer... ehm... that is a real challenge for me, because i have no intention ![]() ![]() _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
![]() |
|
bitRAKE 07 May 2013, 02:13
Code: @@: ; zr|zi|zr²|zi²|cr|ci|4 fmulp st1,st0 ; zi|zr²|zi²|cr|ci|4 fadd st0,st4 ; zi|zr²|zi²|cr|ci|4 fadd st0,st0 ; zi|zr²|zi²|cr|ci|4 fxch st2 ; zi²|zr²|zi|cr|ci|4 fsubp st1,st0 ; zr|zi|cr|ci|4 fadd st0,st2 ; zr|zi|cr|ci|4 fld st1 ; zi|zr|zi|cr|ci|4 fmul st2,st0 ; zi|zr|zi²|cr|ci|4 fld st1 ; zr|zi|zr|zi²|cr|ci|4 fmul st2,st0 ; zr|zi|zr²|zi²|cr|ci|4 ; check exit bounds fld st3 ; X|zr|zi|zr²|zi²|cr|ci|4 fadd st0,st3 ; X|zr|zi|zr²|zi²|cr|ci|4 fcomip st7 ; X|zr|zi|zr²|zi²|cr|ci|4 jnc .color loop @B ; zr|zi|zr²|zi²|cr|ci|4 Never know where one might need a mandlebrot: http://rosettacode.org/wiki/Mandelbrot_set ![]() Edit: finally, had a chance to check code - fixed to something that actually works. Last edited by bitRAKE on 07 May 2013, 23:20; edited 1 time in total |
|||
![]() |
|
Goto page Previous 1, 2, 3, 4 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.