flat assembler
Message board for the users of flat assembler.
 flat assembler > Linux > Mandelbrot [AVX2]
Author
randall

Joined: 03 Dec 2011
Posts: 152
Location: Poland

# Mandelbrot [AVX2]

Simple Mandelbrot generation program using my new code framework. Requires AVX2 capable CPU.

I have uploaded second version of the program. Main loop has been rewritten to process 16 pixels per iteration. Eight threads are used. This version is two times faster than the previous one. It takes 20 ms (min time) to generate 1024x1024 image on Haswell i7.

 Description: Filesize: 269.33 KB Viewed: 1571 Time(s)

Last edited by randall on 24 Jan 2014, 12:38; edited 5 times in total
10 Jan 2014, 09:23
tthsqe

Joined: 20 May 2009
Posts: 701
if you are going to use avx2, why not use the fmadd?
I have a feeling this might reduce the latency of your iteration:
compute the three numbers
 Code: A = x^2+y^2; B = x^2-y^2+a; C = 2*x*y+b;

as
 Code: t = mul(y,y); B = fmadd(x,x,a); C = mul(x,y); A = fmadd(x,x,t); B = sub(B,t); C = fmadd(2.0,C,b);

Not sure if it could also worsen performance, since your loop has some bubbles in it.
10 Jan 2014, 11:05
randall

Joined: 03 Dec 2011
Posts: 152
Location: Poland
This is just a first try I will optimize later.
Can you tell me what bubbles do you see in it? Thanks.
10 Jan 2014, 11:22
tthsqe

Joined: 20 May 2009
Posts: 701
Suppose that the loop begins on clock cycle 0.
Recall that 2 instruction can be issued per cycle (except for a doulble add/sub)
To the right I have optimistically packed the clock cycles durring which the instruction could be executing
 Code: 0 1 2 3 4 5 6 7 8 9 A B C D E F t = mul(y,y);              X X X X X B = fmadd(x,x,a);            X X X X X C = mul(x,y);              X X X X X  A = fmadd(x,x,t);                    X X X X X B = sub(B,t);                          X X X C = fmadd(2.0,C,b);                  X X X X X ; for next loop, y = C, x = B

since the "t" in the next iteration cannot start execute until clock cycle A,
you can get at best 10 cycles per iteration. However, since each iteration contains 6 issues,
it could be as low as 3 cycles per iteration (not counting the iteration count calculation).

What you have to do is put more independent calculations into the loop,
i.e. instead of just doing 8 points, do 32 points per loop iteration.
The cpu's are smart enough to fill in the bubbles, and you should get close to the peak performance for your processor.
10 Jan 2014, 11:52
HaHaAnonymous

Joined: 02 Dec 2012
Posts: 1171
Location: Unknown
Stupid post removed.

Last edited by HaHaAnonymous on 28 Feb 2015, 18:33; edited 1 time in total
10 Jan 2014, 14:54
tthsqe

Joined: 20 May 2009
Posts: 701
HaHa, I think there are programs in intel's website that can emulate the behavior of unsupported instructions.
randall, the use of fmadd reduced the length of the critical path from 11 to 10 (as I just showed). Also, since you have 256 bit integer instructions, why not use them in the loop?
 Code: vpxors ymm8,ymm8,ymm8  ; this will hold integers   mov eax,256     .loop: ...   vcmpltps ymm7,ymm6,[.c_4_0]   vpsubd ymm8,ymm8,ymm7   sub eax,1   jnz .loop   vcvtdq2ps ymm0,ymm8
10 Jan 2014, 21:37
HaHaAnonymous

Joined: 02 Dec 2012
Posts: 1171
Location: Unknown
Stupid post removed.

Last edited by HaHaAnonymous on 28 Feb 2015, 18:36; edited 1 time in total
11 Jan 2014, 00:52
randall

Joined: 03 Dec 2011
Posts: 152
Location: Poland

HaHaAnonymous wrote:
 Quote: I think there are programs in intel's website that can emulate the behavior of unsupported instructions.

And are they available to Linux?

Sure, see http://software.intel.com/en-us/articles/intel-software-development-emulator
12 Jan 2014, 13:43
HaHaAnonymous

Joined: 02 Dec 2012
Posts: 1171
Location: Unknown
Stupid post removed.

Last edited by HaHaAnonymous on 28 Feb 2015, 18:36; edited 1 time in total
12 Jan 2014, 15:23
Melissa

Joined: 12 Apr 2012
Posts: 28
Heh, this intel tool is very slow, seems that it emulates all instructions.
Anyway program is really fast even when emulated
12 Jan 2014, 20:18
randall

Joined: 03 Dec 2011
Posts: 152
Location: Poland
I have uploaded optimized version of the program (now it is two times faster, 20 ms to generate 1024x1024 image on Haswell i7).
23 Jan 2014, 15:58
dstyl

Joined: 23 Jul 2015
Posts: 48
Even flys on my core2 duo with penryn core, INSANE. Always used a fraktal progrmmwritten in c took nearly 2x the time
27 Jul 2015, 13:57
HaHaAnonymous

Joined: 02 Dec 2012
Posts: 1171
Location: Unknown
 Quote: Even flys on my core2 duo with penryn core, INSANE.

It should not even run it (requires AVX2)... How is that possible!?
31 Jul 2015, 04:31
M68kael

Joined: 15 Jul 2015
Posts: 5
Tried to assemble the mandelbrot:

\$ fasm mandelbrot2.asm
flat assembler version 1.71.39 (16384 kilobytes memory)
3 passes, 1600 bytes.
\$ ./mandelbrot2
Illegal instruction (core dumped)

-----------------------------------------------------------------

Some info:

UBUNTU-14.000004.2-desktop-amd64.iso

\$ more /proc/cpuinfo

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm c
onstant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc aperfmperf eagerfpu
pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_sha
dow vnmi flexpriority ept vpid
--------------------------------------------------------

\$ lscpu

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3

Need New distro, computer or AVX2 ??

Btw.

I have installed
sde-external-7.21.0-2015-04-01-lin.tar.bz2

Can i use that & how?
02 Aug 2015, 08:45
randall

Joined: 03 Dec 2011
Posts: 152
Location: Poland
You need AVX2 (your CPU suport AVX but not AVX2).

_________________
https://github.com/michal-z
06 Aug 2015, 07:38
 Display posts from previous: All Posts1 Day7 Days2 Weeks1 Month3 Months6 Months1 Year Oldest FirstNewest First

 Jump to: Select a forum Official----------------Blog General----------------MainDOSWindowsLinuxUnixMenuetOS Specific----------------MacroinstructionsCompiler InternalsIDE DevelopmentOS ConstructionNon-x86 architecturesHigh Level LanguagesProgramming Language DesignProjects and IdeasExamples and Tutorials Other----------------FeedbackHeapTest Area

Forum Rules:
 You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot vote in polls in this forumYou cannot attach files in this forumYou can download files in this forum