flat assembler
Message board for the users of flat assembler.

Index > Projects and Ideas > fun with AVX

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
tthsqe



Joined: 20 May 2009
Posts: 724
tthsqe
Here is the explorer for the Mandelbrot set.
improvements:
[+] implemented FPU, SSE, and 128, 256-bit AVX, FMA paths
[+] implemented better AA method
[+] implemented smooth contour shading
[+] implemented black space optimizations
[+] stats are now on clipboad upon exit
[+] 8 different palettes
[-] still no 32 bit version (see screenshot)
[-] image is no longer draw upside down


Last edited by tthsqe on 16 May 2013, 19:31; edited 5 times in total
Post 10 Apr 2011, 05:12
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 724
tthsqe
This is the new verson.
Timer seems to be fixed now that the calls to QueryPerformanceCounter are spaced out. QueryPerformanceFrequency is also called every frame in case it changes ....???


Description:
Download
Filename: FractalExplorer64.zip
Filesize: 237.81 KB
Downloaded: 617 Time(s)



Last edited by tthsqe on 06 May 2011, 17:41; edited 6 times in total
Post 10 Apr 2011, 05:31
View user's profile Send private message Reply with quote
Alphonso



Joined: 16 Jan 2007
Posts: 294
Alphonso
Nice.

Black Zoom, Depth: 1000
Code:
         FPU     SSE     AVX128  AVX256
2500k @ 4.4GHz        18.5    58.9    63.8    115.0
2500k @ 2.3GHz 9.7     30.8    33.3    60.0    
Post 10 Apr 2011, 23:42
View user's profile Send private message Reply with quote
idle



Joined: 06 Jan 2011
Posts: 359
Location: Ukraine
idle
are there some screen-shots?(we have win32)
Post 11 Apr 2011, 06:41
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 724
tthsqe
Oh, good point. I forgot about win32. New verson on thw way...
Post 11 Apr 2011, 08:15
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
http://ark.intel.com/Compare.aspx?ids=52214,33909
I get different GHz numbers from Intel. I get that Alphonso did some clocking, but was the Wolfdale CPU also (under-) clocked?

My T9300 http://ark.intel.com/Product.aspx?id=33917 got:
Code:
                FPU     SSE
T9300 @ 2.5GHz  3.1     15.2
    

total blackness x1000
Post 16 Apr 2011, 16:32
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Alphonso



Joined: 16 Jan 2007
Posts: 294
Alphonso
Just trying to show some correlation to tthsqe's clocks. Here's some results of C2D P8400 which seems to scale better than my Sandy result.
Code:
                FPU     SSE
P8400 @ 3.0GHz  3.9     19.0
P8400 @ 2.0GHz  2.6     12.7    


I wonder why the difference with your T9300 ~4%. (T9300 SSE 15.2/2.5*3=18.2). If it were really running at 2.4GHz (12x 200MHz) then 15.2/2.4*3=19.
Post 18 Apr 2011, 01:40
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
Alphonso wrote:
I wonder why the difference with your T9300 ~4%. (T9300 SSE 15.2/2.5*3=18.2). If it were really running at 2.4GHz (12x 200MHz) then 15.2/2.4*3=19.

This might be true because this laptop Mitac T8222J is 4+ years old and last BIOS update was around 2007. CPUz did give me 'off' results (as I remember) and this Mandelbrot might just be the proof.

Another set of results, turbo was on 2-core/4-thread (nominal 3.2GHz):
Code:
                  FPU     SSE
i5-650 @ 3.33GHz  6.3     22.6 
    

_________________
My updated idol Very Happy http://www.agner.org/optimize/
Post 18 Apr 2011, 08:36
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2915
Location: [RSP+8*5]
bitRAKE
Code:
                         FPU     SSE
L5410 x2 @ 2.33Ghz      12.04   59.45    

There is a small rectangle in the lower right which isn't being updated.
Post 19 Apr 2011, 04:23
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
T9300@2.4GHz confirmed!

That small rectangle is 3x2 in size and left column (2 pixels) stays red, others stay black (unless updated with other colours). That is in FPU mode. When in SSE, the block gets larger.

I think it has got something to do with jnc / jnz (jnb / jne) differences.
Post 19 Apr 2011, 11:24
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 724
tthsqe
Haha, I though the bottom right corner would be easily overlooked; I never had a problem with it.
Actually, that behaviour there was completely intended and is not a bug. The reason is that each thread handles many (6-16) unpredictably-distributed points at once. When one of these points goes off-screen the whole thread terminates instead of trying to draw off-screen as this gives unpredictable/fatal results. The downside is that the points that were in mid-calculation get ignored.
But now that I have it drawing to memory, going a little over is not an issue, and this will be corrected when I post a 32-bit update.
Post 20 Apr 2011, 04:56
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 724
tthsqe
Results from new verson:
Code:
       Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
ACTUAL CLOCK: 4000 MHz
           MAX    MIN
    FPU:  23.02  7.072
    SSE:  56.46  20.02
 AVX128:   58.1  26.38
 AVX256:  114.8  51.56
4FMA128:  -1.#IO  1.#IO
4FMA256:  -1.#IO  1.#IO
3FMA128:  -1.#IO  1.#IO
3FMA256:  -1.#IO  1.#IO    


Code:
Intel(R) Core(TM)2 Quad  CPU   Q8200  @ 2.33GHz
ACTUAL CLOCK: 2333 MHz
           MAX    MIN
    FPU:   6.03   3.34
    SSE:   28.6   19.9
 AVX128:  -1.#IO  1.#IO
 AVX256:  -1.#IO  1.#IO
4FMA128:  -1.#IO  1.#IO
4FMA256:  -1.#IO  1.#IO
3FMA128:  -1.#IO  1.#IO
3FMA256:  -1.#IO  1.#IO    
Post 06 May 2011, 00:28
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
Code:
Intel(R) Core(TM) i5 CPU         650  @ 3.20GHz
ACTUAL CLOCK: 3333 MHz
           MAX    MIN
    FPU:   1420  1.298 (MAX was about 18.2)
    SSE:  1.294e+005   12.9 (MAX was about 20 actually)
 AVX128:  -1.#IO  1.#IO
 AVX256:  -1.#IO  1.#IO
4FMA128:  -1.#IO  1.#IO
4FMA256:  -1.#IO  1.#IO
3FMA128:  -1.#IO  1.#IO
3FMA256:  -1.#IO  1.#IO
    

the results are less stable than with the previous version.
Post 06 May 2011, 06:59
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
Great work ttsque ! I only found the thread now, didn't check the board for months...time for me to upgrade to AVX CPU, seems to be worth it Smile
Post 07 May 2011, 06:48
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 724
tthsqe
What kind of AVX cpu are you thinking of? I'm thinking bulldozer sounds exciting, but if I understand correctly, without fused mul add instructions, the FX8000 series's peak fpu output is only half that of sandy bridge. Basically, we have:
each pair of coures has access to two 128-bit fpu units
each 128-bit fpu can issue one instruction per clock (whether it be mul, add/sub or fmadd)
This is in contrast to intel's fpu, which can issue a 256-bit mul and a 256-bit add/sub per clock (peak 8 double precision operations / clock).
Am I understanding correctly?
Post 07 May 2011, 17:12
View user's profile Send private message Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
Hm, for the moment I was just thinking about getting some Dual Core Sandy Bridge cheap notebook to play around with it...but may be I really wait until Bulldozer will come out.

I'm really confused about that design of the Bulldozer regarding generally the FPU cores including AVX and FMA. Regarding a quick google search and the latest manuals from AMD there should be FMA ->
http://support.amd.com/us/Processor_TechDocs/47414.pdf

But the instruction latencies in that guide are not very good overall, may be I'm not judging it right though. Just another reason to get my old benchmark ready for a real comparison Smile Intel seems only to add FMA when they go to 22nm.
Post 08 May 2011, 21:23
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
...and "3D transistors" Very Happy
http://www.anandtech.com/print/4313

_________________
My updated idol Very Happy http://www.agner.org/optimize/
Post 09 May 2011, 15:43
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
Hi ttsque,

I just looked a bit at your code. Unfortunatelly I couldn't test the AVX by myself. I see you use 64bit Win, so you got double the registers compared to my benchmark, so there should be some room for optimization, while using more registers to reduce dependencies. Could you try on your AVX quadric code, instead of
Code:
vmulpd ym15,ym1,ym10 
vmulpd ym14,ym0,ym0 
vmulpd ym1,ym1,ym1 
vmulpd ym15,ym15,ym0 
vsubpd ym0,ym14,ym1 
vaddpd ym14,ym14,ym1 
vaddpd ym1,ym15,m32[y0]
vaddpd ym0,ym0,m32[x0]
vmovapd m32[t],ym14
    

this one:
Code:
vmulpd ym15,ym1,ym10 
vmulpd ym14,ym0,ym0 
vmulpd ym1,ym1,ym1 
vmulpd ym15,ym15,ym0 
vsubpd ym13,ym14,ym1 
vaddpd ym12,ym14,ym1 
vaddpd ym0,ym13,m32[x0]
vaddpd ym1,ym15,m32[y0]
vmovapd m32[t],ym12
    

Hope I didn't f**k it up. I just used 2 of the spare ym12/13 ones you mentioned in your ReadMe to try to get rid of direct dependency. But may be it doesn't help too much, I remember a lot of try and error regarding that kind of optimizations as it doesn't seem too logic what the core does sometimes...
Post 16 May 2011, 17:43
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 724
tthsqe
max performanced decreased very slightly (but noticeably) by about 0.3%.
My guess is that it puts more pressure on the renamer.... same dependencies, just more register names

The code that really need fixed is the cubic SSE implementation - it is horrible right now.
Post 17 May 2011, 06:35
View user's profile Send private message Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
...okay, I tried my luck with the cubic SSE. I hopefully find a way (didn't run it by now...) to get rid of two "movaps". Here's the version. I hope there's no mistake, I made formula comments, so I don't get too confused.

I leave the try & error of the reordering of instructions to you. It looks like it's heavily needed, but of course may be not beneficial, like before...
Code:
movaps xm14,xm0         d14 = x
mulpd  xm14,xm0              d14 = x*x 
movaps xm15,xm1           d15 = y
mulpd  xm15,xm1              d15 = y*y

movaps xm12,xm10       d12 = 3
mulpd  xm12,xm14     d12 = 3*x*x
movaps xm13,xm14 d13 = x*x
addpd  xm13,xm15   d13 = x*x + y*y
movapd m16[t],xm13   save d13

subpd  xm12,xm15        d12 = 3*x*x - y*y
mulpd  xm1,xm12            y_new = y * (- y*y + 3*x*x) = - y*y*y + 3*x*x*y
addpd  xm1,m16[y0]

mulpd  xm15,xm10   d15   = 3*y*y
subpd  xm14,xm15        d14   = x*x - 3*y*y
mulpd  xm0,xm14         x_new = x * (  x*x - 3*y*y) =   x*x*x - 3*x*y*y
addpd  xm0,m16[x0]
    
Post 17 May 2011, 19:56
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on YouTube, Twitter.

Website powered by rwasa.