flat assembler
Message board for the users of flat assembler.
 Home   FAQ   Search   Register 
 Profile   Log in to check your private messages   Log in 
flat assembler > Linux > Mandelbrot [AVX2]

Author
Thread Post new topic Reply to topic
randall



Joined: 03 Dec 2011
Posts: 152
Location: Poland
Mandelbrot [AVX2]
Simple Mandelbrot generation program using my new code framework. Requires AVX2 capable CPU.

I have uploaded second version of the program. Main loop has been rewritten to process 16 pixels per iteration. Eight threads are used. This version is two times faster than the previous one. It takes 20 ms (min time) to generate 1024x1024 image on Haswell i7.


Description:
Filesize: 269.33 KB
Viewed: 1176 Time(s)

mandelbrot2.png


Description:
Download
Filename: mandelbrot2.asm
Filesize: 8.59 KB
Downloaded: 95 Time(s)

Description:
Download
Filename: mandelbrot.asm
Filesize: 7.75 KB
Downloaded: 82 Time(s)



Last edited by randall on 24 Jan 2014, 12:38; edited 5 times in total
Post 10 Jan 2014, 09:23
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 666
if you are going to use avx2, why not use the fmadd?
I have a feeling this might reduce the latency of your iteration:
compute the three numbers

Code:
A = x^2+y^2;
B = x^2-y^2+a;
C = 2*x*y+b;


as

Code:
t = mul(y,y);
B = fmadd(x,x,a);
C = mul(x,y);
A = fmadd(x,x,t);
B = sub(B,t);
C = fmadd(2.0,C,b);


Not sure if it could also worsen performance, since your loop has some bubbles in it.
Post 10 Jan 2014, 11:05
View user's profile Send private message Reply with quote
randall



Joined: 03 Dec 2011
Posts: 152
Location: Poland
This is just a first try I will optimize later.
Can you tell me what bubbles do you see in it? Thanks.
Post 10 Jan 2014, 11:22
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 666
Suppose that the loop begins on clock cycle 0.
Recall that 2 instruction can be issued per cycle (except for a doulble add/sub)
To the right I have optimistically packed the clock cycles durring which the instruction could be executing

Code:
                           0 1 2 3 4 5 6 7 8 9 A B C D E F
t = mul(y,y);              X X X X X
B = fmadd(x,x,a);            X X X X X
C = mul(x,y);              X X X X X 
A = fmadd(x,x,t);                    X X X X X
B = sub(B,t);                          X X X
C = fmadd(2.0,C,b);                  X X X X X
; for next loop, y = C, x = B 


since the "t" in the next iteration cannot start execute until clock cycle A,
you can get at best 10 cycles per iteration. However, since each iteration contains 6 issues,
it could be as low as 3 cycles per iteration (not counting the iteration count calculation).

What you have to do is put more independent calculations into the loop,
i.e. instead of just doing 8 points, do 32 points per loop iteration.
The cpu's are smart enough to fill in the bubbles, and you should get close to the peak performance for your processor.
Post 10 Jan 2014, 11:52
View user's profile Send private message Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1173
Location: Unknown
Stupid post removed.


Last edited by HaHaAnonymous on 28 Feb 2015, 18:33; edited 1 time in total
Post 10 Jan 2014, 14:54
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 666
HaHa, I think there are programs in intel's website that can emulate the behavior of unsupported instructions.
randall, the use of fmadd reduced the length of the critical path from 11 to 10 (as I just showed). Also, since you have 256 bit integer instructions, why not use them in the loop?

Code:
  vpxors ymm8,ymm8,ymm8  ; this will hold integers
  mov eax,256
    .loop:
...
  vcmpltps ymm7,ymm6,[.c_4_0]
  vpsubd ymm8,ymm8,ymm7
  sub eax,1
  jnz .loop
  vcvtdq2ps ymm0,ymm8    

Post 10 Jan 2014, 21:37
View user's profile Send private message Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1173
Location: Unknown
Stupid post removed.


Last edited by HaHaAnonymous on 28 Feb 2015, 18:36; edited 1 time in total
Post 11 Jan 2014, 00:52
View user's profile Send private message Reply with quote
randall



Joined: 03 Dec 2011
Posts: 152
Location: Poland

HaHaAnonymous wrote:

Quote:

I think there are programs in intel's website that can emulate the behavior of unsupported instructions.


And are they available to Linux?



Sure, see http://software.intel.com/en-us/articles/intel-software-development-emulator
Post 12 Jan 2014, 13:43
View user's profile Send private message Visit poster's website Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1173
Location: Unknown
Stupid post removed.


Last edited by HaHaAnonymous on 28 Feb 2015, 18:36; edited 1 time in total
Post 12 Jan 2014, 15:23
View user's profile Send private message Reply with quote
Melissa



Joined: 12 Apr 2012
Posts: 26
Heh, this intel tool is very slow, seems that it emulates all instructions.
Anyway program is really fast even when emulated Wink
Post 12 Jan 2014, 20:18
View user's profile Send private message Reply with quote
randall



Joined: 03 Dec 2011
Posts: 152
Location: Poland
I have uploaded optimized version of the program (now it is two times faster, 20 ms to generate 1024x1024 image on Haswell i7).
Post 23 Jan 2014, 15:58
View user's profile Send private message Visit poster's website Reply with quote
dstyl



Joined: 23 Jul 2015
Posts: 48
Even flys on my core2 duo with penryn core, INSANE. Always used a fraktal progrmmwritten in c took nearly 2x the time
Post 27 Jul 2015, 13:57
View user's profile Send private message Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1173
Location: Unknown

Quote:

Even flys on my core2 duo with penryn core, INSANE.


It should not even run it (requires AVX2)... How is that possible!?
Post 31 Jul 2015, 04:31
View user's profile Send private message Reply with quote
M68kael



Joined: 15 Jul 2015
Posts: 5
Tried to assemble the mandelbrot:

$ fasm mandelbrot2.asm
flat assembler version 1.71.39 (16384 kilobytes memory)
3 passes, 1600 bytes.
$ ./mandelbrot2
Illegal instruction (core dumped)

-----------------------------------------------------------------

Some info:

UBUNTU-14.000004.2-desktop-amd64.iso

$ more /proc/cpuinfo

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm c
onstant_tsc arch_perfmon pebs bts nopl xtopology nonstop_tsc aperfmperf eagerfpu
pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_sha
dow vnmi flexpriority ept vpid
--------------------------------------------------------

$ lscpu

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2


Need New distro, computer or AVX2 ??

please help me get going, not so eager switching back to Windows. Wink

Btw.

I have installed
sde-external-7.21.0-2015-04-01-lin.tar.bz2

Can i use that & how?
Post 02 Aug 2015, 08:45
View user's profile Send private message Reply with quote
randall



Joined: 03 Dec 2011
Posts: 152
Location: Poland
You need AVX2 (your CPU suport AVX but not AVX2).

_________________
https://github.com/michal-z
Post 06 Aug 2015, 07:38
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >

Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Powered by phpBB © 2001-2005 phpBB Group.

Main index   Download   Documentation   Examples   Message board
Copyright © 2004-2016, Tomasz Grysztar.