fun with AVX

Index > Projects and Ideas > fun with AVX

Goto page 1, 2 Next

Author

Thread

tthsqe

Joined: 20 May 2009
Posts: 767

tthsqe 10 Apr 2011, 05:12

Here is the explorer for the Mandelbrot set.
improvements:
[+] implemented FPU, SSE, and 128, 256-bit AVX, FMA paths
[+] implemented better AA method
[+] implemented smooth contour shading
[+] implemented black space optimizations
[+] stats are now on clipboad upon exit
[+] 8 different palettes
[-] still no 32 bit version (see screenshot)
[-] image is no longer draw upside down

Last edited by tthsqe on 16 May 2013, 19:31; edited 5 times in total

10 Apr 2011, 05:12

tthsqe

Joined: 20 May 2009
Posts: 767

tthsqe 10 Apr 2011, 05:31

This is the new verson.
Timer seems to be fixed now that the calls to QueryPerformanceCounter are spaced out. QueryPerformanceFrequency is also called every frame in case it changes ....???

Description:		Download
Filename:	FractalExplorer64.zip
Filesize:	237.81 KB
Downloaded:	1233 Time(s)

Last edited by tthsqe on 06 May 2011, 17:41; edited 6 times in total

10 Apr 2011, 05:31

Alphonso

Joined: 16 Jan 2007
Posts: 295

Alphonso 10 Apr 2011, 23:42

Nice.

Black Zoom, Depth: 1000

Code:

         FPU     SSE     AVX128  AVX256
2500k @ 4.4GHz        18.5    58.9    63.8    115.0
2500k @ 2.3GHz 9.7     30.8    33.3    60.0

10 Apr 2011, 23:42

idle

Joined: 06 Jan 2011
Posts: 440
Location: Ukraine

idle 11 Apr 2011, 06:41

are there some screen-shots?(we have win32)

11 Apr 2011, 06:41

tthsqe

Joined: 20 May 2009
Posts: 767

tthsqe 11 Apr 2011, 08:15

Oh, good point. I forgot about win32. New verson on thw way...

11 Apr 2011, 08:15

Madis731

Joined: 25 Sep 2003
Posts: 2138
Location: Estonia

Madis731 16 Apr 2011, 16:32

http://ark.intel.com/Compare.aspx?ids=52214,33909
I get different GHz numbers from Intel. I get that Alphonso did some clocking, but was the Wolfdale CPU also (under-) clocked?

My T9300 http://ark.intel.com/Product.aspx?id=33917 got:

Code:

                FPU     SSE
T9300 @ 2.5GHz  3.1     15.2

total blackness x1000

16 Apr 2011, 16:32

Alphonso

Joined: 16 Jan 2007
Posts: 295

Alphonso 18 Apr 2011, 01:40

Just trying to show some correlation to tthsqe's clocks. Here's some results of C2D P8400 which seems to scale better than my Sandy result.

Code:

                FPU     SSE
P8400 @ 3.0GHz  3.9     19.0
P8400 @ 2.0GHz  2.6     12.7

I wonder why the difference with your T9300 ~4%. (T9300 SSE 15.2/2.5*3=18.2). If it were really running at 2.4GHz (12x 200MHz) then 15.2/2.4*3=19.

18 Apr 2011, 01:40

Madis731

Joined: 25 Sep 2003
Posts: 2138
Location: Estonia

Madis731 18 Apr 2011, 08:36

Alphonso wrote:

I wonder why the difference with your T9300 ~4%. (T9300 SSE 15.2/2.5*3=18.2). If it were really running at 2.4GHz (12x 200MHz) then 15.2/2.4*3=19.

This might be true because this laptop Mitac T8222J is 4+ years old and last BIOS update was around 2007. CPUz did give me 'off' results (as I remember) and this Mandelbrot might just be the proof.

Another set of results, turbo was on 2-core/4-thread (nominal 3.2GHz):

Code:

                  FPU     SSE
i5-650 @ 3.33GHz  6.3     22.6

_________________
My updated idol Very Happy

http://www.agner.org/optimize/

18 Apr 2011, 08:36

bitRAKE

Joined: 21 Jul 2003
Posts: 4260
Location: vpcmpistri

bitRAKE 19 Apr 2011, 04:23

Code:

                         FPU     SSE
L5410 x2 @ 2.33Ghz      12.04   59.45

There is a small rectangle in the lower right which isn't being updated.

19 Apr 2011, 04:23

Madis731

Joined: 25 Sep 2003
Posts: 2138
Location: Estonia

Madis731 19 Apr 2011, 11:24

T9300@2.4GHz confirmed!

That small rectangle is 3x2 in size and left column (2 pixels) stays red, others stay black (unless updated with other colours). That is in FPU mode. When in SSE, the block gets larger.

I think it has got something to do with jnc / jnz (jnb / jne) differences.

19 Apr 2011, 11:24

tthsqe

Joined: 20 May 2009
Posts: 767

tthsqe 20 Apr 2011, 04:56

Haha, I though the bottom right corner would be easily overlooked; I never had a problem with it.
Actually, that behaviour there was completely intended and is not a bug. The reason is that each thread handles many (6-16) unpredictably-distributed points at once. When one of these points goes off-screen the whole thread terminates instead of trying to draw off-screen as this gives unpredictable/fatal results. The downside is that the points that were in mid-calculation get ignored.
But now that I have it drawing to memory, going a little over is not an issue, and this will be corrected when I post a 32-bit update.

20 Apr 2011, 04:56

tthsqe

Joined: 20 May 2009
Posts: 767

tthsqe 06 May 2011, 00:28

Results from new verson:

Code:

       Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
ACTUAL CLOCK: 4000 MHz
           MAX    MIN
    FPU:  23.02  7.072
    SSE:  56.46  20.02
 AVX128:   58.1  26.38
 AVX256:  114.8  51.56
4FMA128:  -1.#IO  1.#IO
4FMA256:  -1.#IO  1.#IO
3FMA128:  -1.#IO  1.#IO
3FMA256:  -1.#IO  1.#IO

Code:

Intel(R) Core(TM)2 Quad  CPU   Q8200  @ 2.33GHz
ACTUAL CLOCK: 2333 MHz
           MAX    MIN
    FPU:   6.03   3.34
    SSE:   28.6   19.9
 AVX128:  -1.#IO  1.#IO
 AVX256:  -1.#IO  1.#IO
4FMA128:  -1.#IO  1.#IO
4FMA256:  -1.#IO  1.#IO
3FMA128:  -1.#IO  1.#IO
3FMA256:  -1.#IO  1.#IO

06 May 2011, 00:28

Madis731

Joined: 25 Sep 2003
Posts: 2138
Location: Estonia

Madis731 06 May 2011, 06:59

Code:

Intel(R) Core(TM) i5 CPU         650  @ 3.20GHz
ACTUAL CLOCK: 3333 MHz
           MAX    MIN
    FPU:   1420  1.298 (MAX was about 18.2)
    SSE:  1.294e+005   12.9 (MAX was about 20 actually)
 AVX128:  -1.#IO  1.#IO
 AVX256:  -1.#IO  1.#IO
4FMA128:  -1.#IO  1.#IO
4FMA256:  -1.#IO  1.#IO
3FMA128:  -1.#IO  1.#IO
3FMA256:  -1.#IO  1.#IO

the results are less stable than with the previous version.

06 May 2011, 06:59

Kuemmel

Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany

Kuemmel 07 May 2011, 06:48

Great work ttsque ! I only found the thread now, didn't check the board for months...time for me to upgrade to AVX CPU, seems to be worth it Smile

07 May 2011, 06:48

tthsqe

Joined: 20 May 2009
Posts: 767

tthsqe 07 May 2011, 17:12

What kind of AVX cpu are you thinking of? I'm thinking bulldozer sounds exciting, but if I understand correctly, without fused mul add instructions, the FX8000 series's peak fpu output is only half that of sandy bridge. Basically, we have:
each pair of coures has access to two 128-bit fpu units
each 128-bit fpu can issue one instruction per clock (whether it be mul, add/sub or fmadd)
This is in contrast to intel's fpu, which can issue a 256-bit mul and a 256-bit add/sub per clock (peak 8 double precision operations / clock).
Am I understanding correctly?

07 May 2011, 17:12

Kuemmel

Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany

Kuemmel 08 May 2011, 21:23

Hm, for the moment I was just thinking about getting some Dual Core Sandy Bridge cheap notebook to play around with it...but may be I really wait until Bulldozer will come out.

I'm really confused about that design of the Bulldozer regarding generally the FPU cores including AVX and FMA. Regarding a quick google search and the latest manuals from AMD there should be FMA ->
http://support.amd.com/us/Processor_TechDocs/47414.pdf

But the instruction latencies in that guide are not very good overall, may be I'm not judging it right though. Just another reason to get my old benchmark ready for a real comparison Smile

Intel seems only to add FMA when they go to 22nm.

08 May 2011, 21:23

Madis731

Joined: 25 Sep 2003
Posts: 2138
Location: Estonia

Madis731 09 May 2011, 15:43

...and "3D transistors" Very Happy

http://www.anandtech.com/print/4313

_________________
My updated idol Very Happy

http://www.agner.org/optimize/

09 May 2011, 15:43

Kuemmel

Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany

Kuemmel 16 May 2011, 17:43

Hi ttsque,

I just looked a bit at your code. Unfortunatelly I couldn't test the AVX by myself. I see you use 64bit Win, so you got double the registers compared to my benchmark, so there should be some room for optimization, while using more registers to reduce dependencies. Could you try on your AVX quadric code, instead of

Code:

vmulpd ym15,ym1,ym10 
vmulpd ym14,ym0,ym0 
vmulpd ym1,ym1,ym1 
vmulpd ym15,ym15,ym0 
vsubpd ym0,ym14,ym1 
vaddpd ym14,ym14,ym1 
vaddpd ym1,ym15,m32[y0]
vaddpd ym0,ym0,m32[x0]
vmovapd m32[t],ym14

this one:

Code:

vmulpd ym15,ym1,ym10 
vmulpd ym14,ym0,ym0 
vmulpd ym1,ym1,ym1 
vmulpd ym15,ym15,ym0 
vsubpd ym13,ym14,ym1 
vaddpd ym12,ym14,ym1 
vaddpd ym0,ym13,m32[x0]
vaddpd ym1,ym15,m32[y0]
vmovapd m32[t],ym12

Hope I didn't f**k it up. I just used 2 of the spare ym12/13 ones you mentioned in your ReadMe to try to get rid of direct dependency. But may be it doesn't help too much, I remember a lot of try and error regarding that kind of optimizations as it doesn't seem too logic what the core does sometimes...

16 May 2011, 17:43

tthsqe

Joined: 20 May 2009
Posts: 767

tthsqe 17 May 2011, 06:35

max performanced decreased very slightly (but noticeably) by about 0.3%.
My guess is that it puts more pressure on the renamer.... same dependencies, just more register names

The code that really need fixed is the cubic SSE implementation - it is horrible right now.

17 May 2011, 06:35

Kuemmel

Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany

Kuemmel 17 May 2011, 19:56

...okay, I tried my luck with the cubic SSE. I hopefully find a way (didn't run it by now...) to get rid of two "movaps". Here's the version. I hope there's no mistake, I made formula comments, so I don't get too confused.

I leave the try & error of the reordering of instructions to you. It looks like it's heavily needed, but of course may be not beneficial, like before...

Code:

movaps xm14,xm0         d14 = x
mulpd  xm14,xm0              d14 = x*x 
movaps xm15,xm1           d15 = y
mulpd  xm15,xm1              d15 = y*y

movaps xm12,xm10       d12 = 3
mulpd  xm12,xm14     d12 = 3*x*x
movaps xm13,xm14 d13 = x*x
addpd  xm13,xm15   d13 = x*x + y*y
movapd m16[t],xm13   save d13

subpd  xm12,xm15        d12 = 3*x*x - y*y
mulpd  xm1,xm12            y_new = y * (- y*y + 3*x*x) = - y*y*y + 3*x*x*y
addpd  xm1,m16[y0]

mulpd  xm15,xm10   d15   = 3*y*y
subpd  xm14,xm15        d14   = x*x - 3*y*y
mulpd  xm0,xm14         x_new = x * (  x*x - 3*y*y) =   x*x*x - 3*x*y*y
addpd  xm0,m16[x0]

17 May 2011, 19:56

Goto page 1, 2 Next

< Last Thread | Next Thread >

Forum Rules:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum