flat assembler
Message board for the users of flat assembler.
![]() Goto page 1, 2, 3, 4 Next |
Author |
|
baldr 01 May 2013, 00:19
fredlllll,
Your initialization z = 0 is outside of y and x iteration loops. Apparently if z is already escaped |z|<=R disk, more iterations will lead it farther (I'm assuming R>=2). |
|||
![]() |
|
fredlllll 01 May 2013, 09:02
ejup you are right.
funny thing is, that i had the same idea in half sleep this morning ![]() thanks for the help ah and did you see any possibilities to optimize my code for more speed? /edit: btw it seems that the c code produces rubbish while the asm code produces the correct mandel fractal. anyone know why this could be?? |
|||
![]() |
|
baldr 01 May 2013, 17:12
fredlllll wrote: btw it seems that the c code produces rubbish while the asm code produces the correct mandel fractal. anyone know why this could be??
|
||||||||||
![]() |
|
fredlllll 01 May 2013, 17:54
uhhhm i know what the problem in the c code is. im changing zr, and then use it for the calculation of zi. for the proper result i put the new zr into a temp variable and then calculate the new zi and then put the tmp in zr and then it works
but thanks for the picture. i already wondered what it would look like ;3 now there is still a problem. my asm code is slower than that what the compiler creates (i use vs2010 compiler and compile it as c-code (not c++, because i want to avoid the name mangling so i can replace the dll with my asm dll) and use optimizations for speed /O2 and /Ot). i updated the asm and c code above to my current stand. any ideas how to get it faster? i thought the memory accesses on the old one were the problem. so i eliminated them, but it doesnt seem to improve. any ideas? |
|||
![]() |
|
baldr 01 May 2013, 19:54
fredlllll,
I overlooked this too. With that change picture becomes clearer. ![]() To speed up things, MMX/SSE/AVX/FMA could be used. Use search, you'll find many examples here.
|
||||||||||
![]() |
|
fredlllll 02 May 2013, 16:01
Well i hoped to do it with the fpu only. The vs2010 compiler also uses only the fpu and is 10% faster. But when i look on the disasm i dont get what its doing
_________________ --for science |
|||
![]() |
|
AsmGuru62 02 May 2013, 17:04
Nice.
There are rumours that compilers can't produce good code. We recently talked about this, didn't we? ![]() |
|||
![]() |
|
fredlllll 02 May 2013, 17:26
well when good == human readable yes.
but as it seems the vs2010 compiler generates code that is 10% faster than mine, and that only using the fpu. maybe because it takes the cpu pipeline into account? or maybe because it uses only the fastest instructions. first i thought getting the memory accesses out of my loop would help, but that didnt do anything =/. |
|||
![]() |
|
tthsqe 02 May 2013, 21:20
what is your baseline speed in iterations per second? (count each evaluation of z=z^2+c as an iterations and see how many your program is doing per second). I remember looking at as VS's output for this kind of stuff and it was doing a 4x loop unroll; this could be the cause of the speed gain.
The real gain in Mandelbrot is obtained through multi-threading and vectorization... |
|||
![]() |
|
fredlllll 02 May 2013, 22:16
uhhm at the moment im testing with 3000 times 100 iterations with 38x78 pixels (using console output) which takes around 0.85 to 0.9 seconds with my code and around 0.75 to 0.85 with the visual studio 2010 code.
so that should be around 1 000 000 000 iterations per second or more? got an i5 @ 3.3 Ghz hmm ejup maybe there is an unroll in the asm. so multithreading okay (i want to use this as a "core" so one instance per physical/logical core should run) but what do you mean with vectorization? unfortunately my math education doesnt reach this far down. im "only" a computer science bachelor student =/ the only stuff we really learn is java *pukes*. c++ only for some algorithms. i think im the only one who does asm in the whole course (50 people) =/ |
|||
![]() |
|
hopcode 03 May 2013, 00:48
pelles outputs better code using the same FPU set. i estimate ~3x faster than yours above. after sorting your source, considering a mean between rec. throughput and latency an optimistic ~250 cycles, against, i estimate, 50 (pessimistic) using SSE2. then considering vectors on double precision reduceable to ~35 cy.
consiering a simple prefetch strategy on vectors, 20cy. then if you core it 4x 9cy. also pessimistically 25x faster than the code above... ...before speaking about "vectors" AND "cores" from a java point of view ![]() _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
![]() |
|
fredlllll 03 May 2013, 06:51
so you still didnt tell me what you mean with that vector stuff (i hope you mean the x,y,z stuff)
and what do you mean with prefetch? where can i prefetch here? |
|||
![]() |
|
tthsqe 03 May 2013, 07:25
vectors simply refer to using the SIMD instructions that operate on multiple channels of data.
EX: Code: fld [b] fld [c] faddp st1,st0 ;let's add numbers one at a time fstp [c] compared with Code: vaddpd a,b,c ; let's add numbers four at a time I don't believe prefetching is relevant here to making your drawer faster. |
|||
![]() |
|
fredlllll 03 May 2013, 08:42
what is vaddpd exactly doing with its 3 arguments? adding 2 and 3 and putting into 1?
_________________ --for science |
|||
![]() |
|
hopcode 03 May 2013, 10:54
tthsqe wrote: I don't believe prefetching is relevant here to making your drawer faster. it is indeed, always. even with non-x86 CPU, but especially with newer processor. number of iterations may be the reason they dont fit the cache by outputting AND/OR from different thread. another reason is designing it for multiprocessor in order to avoid to rewrite the code (a complete rewrite is in most cases). there is no multiprocess without a cache strategy, even when only outputting as in the case above. there are several other reasons. first learn walking http://www.songho.ca/misc/sse/sse.html then learn flying http://www.agner.org/optimize/#vectorclass ![]() _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
![]() |
|
revolution 03 May 2013, 11:16
Indeed. If you want to improve computation throughput then optimising data flow is generally far more important than instruction selection. If you ignore the data flow then you are probably wasting your time worrying about saving a clock tick, or two, by using different instructions when the data is not yet available for the CPU to process. CPUs are fast, DRAM is slow, VRAM is slower.
Use your cache and stream the data through the CPU. Optimising for speed is hard. Really hard. I well optimised program will need a lot of work even for one system. And it gets even harder when you want it to run fast on all the different systems out there. |
|||
![]() |
|
tthsqe 03 May 2013, 20:19
Wait - I was in no way discounting the need to consider optimizing data flow/organization. I was simply doubting that it plays a role in in a double precision mandelbrot fractal drawer
(A) in double precision, we do a bunch of calculations in cpu registers and spend 0.0001% of the time writing results to memory. instruction stream most likely fits into instruction cache (B) in mutiprecision, we do a bunch of calculations in the data cache because we have no choice. instruction stream might not fit into instruction cache I don't see how cache management plays a bit role in (A), but I have definitely improved performance in (B) by optimizing data locations / reducing instruction stream length... also, vaddpd a,b,c does a = b + c |
|||
![]() |
|
hopcode 04 May 2013, 00:35
eh eh eh, today hopcode is in a good mood
![]() tthsqe wrote: ...a bunch of calculations in cpu registers and spend 0.0001% of the time writing results to memory... suppose you have a routine of 100 theoretical cycles with simple data movements, reading or writing, no core, no prefetch. consider size of the input/output vector. apply this formula of my own i set now in the public domain: Code: theory | factor | actual cycles | vector size -----------+-----------+-------------------------+----------------------- 100cycles | K=4, 2^2 | 100 + (100 / 4) = 125cy | fits the L1 cache 100cycles | K=2, 2^1 | 100 + (100 / 2) = 150cy | fits half the L2 cache 100cycles | K=1, 2^0 | 100 + (100 / 1) = 200cy | larger than L2 then you can consider theorethical rec. throughput and subtract those cycles as they are from the manual, and it works correctly. a = (t-r) + t/k the above may vary trying prefetch/coring. factor K increase/decrease accordingly. i dont know about L3. AFAIK, it's not that enthusiastic above the performances above. follows pelles-c output 64bit. as you can see it works scalar and afterall it's not that bad for what is its C-job ![]() Code: $getMandel: push rbx push rsi push rdi push r12 sub rsp,72 movdqa [rsp+(0)],xmm6 movdqa [rsp+(16)],xmm7 movdqa [rsp+(32)],xmm8 movdqa [rsp+(48)],xmm9 mov eax,dword [rsp+(144)] mov r10d,dword [rsp+(152)] mov r11,qword [rsp+(160)] movsd xmm0,qword [rdx+(8)] subsd xmm0,qword [rcx+(8)] cvtsi2sd xmm1,r10d divsd xmm0,xmm1 movsd xmm1,qword [rdx] subsd xmm1,qword [rcx] cvtsi2sd xmm2,eax divsd xmm1,xmm2 movsd xmm2,qword [rdx+(8)] test r10d,r10d jle @13 movsd xmm3,qword [(@28) wrt rip] xor edx,edx xor ebx,ebx @14: movsd xmm4,qword [rcx] test eax,eax jle @20 xor esi,esi @18: mov edi,r8d test edi,edi jle @24 xorpd xmm5,xmm5 xorpd xmm6,xmm6 @22: movapd xmm7,xmm5 mulsd xmm7,xmm5 movapd xmm8,xmm6 mulsd xmm8,xmm6 subsd xmm7,xmm8 movapd xmm5,xmm7 addsd xmm5,xmm4 movapd xmm7,xmm5 mulsd xmm7,xmm6 mulsd xmm7,xmm3 movapd xmm6,xmm7 addsd xmm6,xmm2 sub edi,1 movsd xmm7,qword [r9] movapd xmm8,xmm5 mulsd xmm8,xmm5 movapd xmm9,xmm6 mulsd xmm9,xmm6 addsd xmm8,xmm9 comisd xmm7,xmm8 jb @24 test edi,edi jg @22 @24: mov r12d,ebx add ebx,1 movsxd r12,r12d neg edi add edi,r8d mov dword [r11+r12*4],edi add esi,1 addsd xmm4,xmm1 cmp esi,eax jl @18 @20: add edx,1 subsd xmm2,xmm0 cmp edx,r10d jl @14 @13: movdqa xmm6,[rsp+(0)] movdqa xmm7,[rsp+(16)] movdqa xmm8,[rsp+(32)] movdqa xmm9,[rsp+(48)] add rsp,72 pop r12 pop rdi pop rsi pop rbx ret Cheers, ![]() _________________ ⠓⠕⠏⠉⠕⠙⠑ |
|||
![]() |
|
hopcode 04 May 2013, 10:23
the numbers above need some improvements. as far as i have time
i will improve the formula using more parameters and politing it. some example against it: a) a routine of 12cy "pairable" to 8,NO-SSE, NO-CORE, NO-PREFETCH. CPU does its best renaming registers and writing once in memory. Code: larger than L2 -> 5,3 cy fits L2 -> 4,5 cy half L2 -> 4,2 cy fits L1 -> 5,1 cy b) a routine of 55 theorehical cycles "pairable" to 35 (after calculation on rec. throughput) reading the cacheline USE-SSE, NO-CORE, NO-PREFETCH. note how strange it behaves for the same instruction stream. Code: larger than L2 -> 41 cy data reads take 80% of cycles fits L2 -> 23 cy data reads take 53% of cycles half L2 -> 23 cy data reads take 51% of cycles fits L1 -> 23 cy data reads take 50% of cycles 1/4 L1 -> 33 cy data reads take 36% of cycles <--- Cheers, ![]() _________________ ⠓⠕⠏⠉⠕⠙⠑ Last edited by hopcode on 04 May 2013, 10:46; edited 2 times in total |
|||
![]() |
|
Goto page 1, 2, 3, 4 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.