flat assembler
Message board for the users of flat assembler.
![]() |
Author |
|
randall 10 Jan 2014, 09:23
Simple Mandelbrot generation program using my new code framework. Requires AVX2 capable CPU.
I have uploaded second version of the program. Main loop has been rewritten to process 16 pixels per iteration. Eight threads are used. This version is two times faster than the previous one. It takes 20 ms (min time) to generate 1024x1024 image on Haswell i7.
Last edited by randall on 24 Jan 2014, 12:38; edited 5 times in total |
||||||||||||||||||||||||||||||
![]() |
|
randall 10 Jan 2014, 11:22
This is just a first try I will optimize later.
Can you tell me what bubbles do you see in it? Thanks. |
|||
![]() |
|
tthsqe 10 Jan 2014, 11:52
Suppose that the loop begins on clock cycle 0.
Recall that 2 instruction can be issued per cycle (except for a doulble add/sub) To the right I have optimistically packed the clock cycles durring which the instruction could be executing Code: 0 1 2 3 4 5 6 7 8 9 A B C D E F t = mul(y,y); X X X X X B = fmadd(x,x,a); X X X X X C = mul(x,y); X X X X X A = fmadd(x,x,t); X X X X X B = sub(B,t); X X X C = fmadd(2.0,C,b); X X X X X ; for next loop, y = C, x = B since the "t" in the next iteration cannot start execute until clock cycle A, you can get at best 10 cycles per iteration. However, since each iteration contains 6 issues, it could be as low as 3 cycles per iteration (not counting the iteration count calculation). What you have to do is put more independent calculations into the loop, i.e. instead of just doing 8 points, do 32 points per loop iteration. The cpu's are smart enough to fill in the bubbles, and you should get close to the peak performance for your processor. |
|||
![]() |
|
HaHaAnonymous 10 Jan 2014, 14:54
[ Post removed by author. ]
Last edited by HaHaAnonymous on 28 Feb 2015, 18:33; edited 1 time in total |
|||
![]() |
|
tthsqe 10 Jan 2014, 21:37
HaHa, I think there are programs in intel's website that can emulate the behavior of unsupported instructions.
randall, the use of fmadd reduced the length of the critical path from 11 to 10 (as I just showed). Also, since you have 256 bit integer instructions, why not use them in the loop? Code: vpxors ymm8,ymm8,ymm8 ; this will hold integers mov eax,256 .loop: ... vcmpltps ymm7,ymm6,[.c_4_0] vpsubd ymm8,ymm8,ymm7 sub eax,1 jnz .loop vcvtdq2ps ymm0,ymm8 |
|||
![]() |
|
HaHaAnonymous 11 Jan 2014, 00:52
[ Post removed by author. ]
Last edited by HaHaAnonymous on 28 Feb 2015, 18:36; edited 1 time in total |
|||
![]() |
|
randall 12 Jan 2014, 13:43
HaHaAnonymous wrote:
Sure, see http://software.intel.com/en-us/articles/intel-software-development-emulator |
|||
![]() |
|
HaHaAnonymous 12 Jan 2014, 15:23
[ Post removed by author. ]
Last edited by HaHaAnonymous on 28 Feb 2015, 18:36; edited 1 time in total |
|||
![]() |
|
Melissa 12 Jan 2014, 20:18
Heh, this intel tool is very slow, seems that it emulates all instructions.
Anyway program is really fast even when emulated ![]() |
|||
![]() |
|
randall 23 Jan 2014, 15:58
I have uploaded optimized version of the program (now it is two times faster, 20 ms to generate 1024x1024 image on Haswell i7).
|
|||
![]() |
|
dstyl 27 Jul 2015, 13:57
Even flys on my core2 duo with penryn core, INSANE. Always used a fraktal progrmmwritten in c took nearly 2x the time
|
|||
![]() |
|
HaHaAnonymous 31 Jul 2015, 04:31
Quote:
It should not even run it (requires AVX2)... How is that possible!? |
|||
![]() |
|
M68kael 02 Aug 2015, 08:45
Tried to assemble the mandelbrot:
$ fasm mandelbrot2.asm flat assembler version 1.71.39 (16384 kilobytes memory) 3 passes, 1600 bytes. $ ./mandelbrot2 Illegal instruction (core dumped) ----------------------------------------------------------------- Some info: UBUNTU-14.000004.2-desktop-amd64.iso $ more /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm c onstant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_sha dow vnmi flexpriority ept vpid -------------------------------------------------------- $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Need New distro, computer or AVX2 ?? please help me get going, not so eager switching back to Windows. ![]() Btw. I have installed sde-external-7.21.0-2015-04-01-lin.tar.bz2 Can i use that & how? |
|||
![]() |
|
randall 06 Aug 2015, 07:38
You need AVX2 (your CPU suport AVX but not AVX2).
|
|||
![]() |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.