flat assembler
Message board for the users of flat assembler.
Index
> Projects and Ideas > fun with AVX Goto page 1, 2 Next |
Author |
|
tthsqe 10 Apr 2011, 05:31
This is the new verson.
Timer seems to be fixed now that the calls to QueryPerformanceCounter are spaced out. QueryPerformanceFrequency is also called every frame in case it changes ....???
Last edited by tthsqe on 06 May 2011, 17:41; edited 6 times in total |
|||||||||||
10 Apr 2011, 05:31 |
|
Alphonso 10 Apr 2011, 23:42
Nice.
Black Zoom, Depth: 1000 Code: FPU SSE AVX128 AVX256 2500k @ 4.4GHz 18.5 58.9 63.8 115.0 2500k @ 2.3GHz 9.7 30.8 33.3 60.0 |
|||
10 Apr 2011, 23:42 |
|
idle 11 Apr 2011, 06:41
are there some screen-shots?(we have win32)
|
|||
11 Apr 2011, 06:41 |
|
tthsqe 11 Apr 2011, 08:15
Oh, good point. I forgot about win32. New verson on thw way...
|
|||
11 Apr 2011, 08:15 |
|
Madis731 16 Apr 2011, 16:32
http://ark.intel.com/Compare.aspx?ids=52214,33909
I get different GHz numbers from Intel. I get that Alphonso did some clocking, but was the Wolfdale CPU also (under-) clocked? My T9300 http://ark.intel.com/Product.aspx?id=33917 got: Code: FPU SSE T9300 @ 2.5GHz 3.1 15.2 total blackness x1000 |
|||
16 Apr 2011, 16:32 |
|
Alphonso 18 Apr 2011, 01:40
Just trying to show some correlation to tthsqe's clocks. Here's some results of C2D P8400 which seems to scale better than my Sandy result.
Code: FPU SSE P8400 @ 3.0GHz 3.9 19.0 P8400 @ 2.0GHz 2.6 12.7 I wonder why the difference with your T9300 ~4%. (T9300 SSE 15.2/2.5*3=18.2). If it were really running at 2.4GHz (12x 200MHz) then 15.2/2.4*3=19. |
|||
18 Apr 2011, 01:40 |
|
Madis731 18 Apr 2011, 08:36
Alphonso wrote: I wonder why the difference with your T9300 ~4%. (T9300 SSE 15.2/2.5*3=18.2). If it were really running at 2.4GHz (12x 200MHz) then 15.2/2.4*3=19. This might be true because this laptop Mitac T8222J is 4+ years old and last BIOS update was around 2007. CPUz did give me 'off' results (as I remember) and this Mandelbrot might just be the proof. Another set of results, turbo was on 2-core/4-thread (nominal 3.2GHz): Code: FPU SSE i5-650 @ 3.33GHz 6.3 22.6 |
|||
18 Apr 2011, 08:36 |
|
bitRAKE 19 Apr 2011, 04:23
Code: FPU SSE L5410 x2 @ 2.33Ghz 12.04 59.45 There is a small rectangle in the lower right which isn't being updated. |
|||
19 Apr 2011, 04:23 |
|
Madis731 19 Apr 2011, 11:24
T9300@2.4GHz confirmed!
That small rectangle is 3x2 in size and left column (2 pixels) stays red, others stay black (unless updated with other colours). That is in FPU mode. When in SSE, the block gets larger. I think it has got something to do with jnc / jnz (jnb / jne) differences. |
|||
19 Apr 2011, 11:24 |
|
tthsqe 20 Apr 2011, 04:56
Haha, I though the bottom right corner would be easily overlooked; I never had a problem with it.
Actually, that behaviour there was completely intended and is not a bug. The reason is that each thread handles many (6-16) unpredictably-distributed points at once. When one of these points goes off-screen the whole thread terminates instead of trying to draw off-screen as this gives unpredictable/fatal results. The downside is that the points that were in mid-calculation get ignored. But now that I have it drawing to memory, going a little over is not an issue, and this will be corrected when I post a 32-bit update. |
|||
20 Apr 2011, 04:56 |
|
tthsqe 06 May 2011, 00:28
Results from new verson:
Code: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz ACTUAL CLOCK: 4000 MHz MAX MIN FPU: 23.02 7.072 SSE: 56.46 20.02 AVX128: 58.1 26.38 AVX256: 114.8 51.56 4FMA128: -1.#IO 1.#IO 4FMA256: -1.#IO 1.#IO 3FMA128: -1.#IO 1.#IO 3FMA256: -1.#IO 1.#IO Code: Intel(R) Core(TM)2 Quad CPU Q8200 @ 2.33GHz ACTUAL CLOCK: 2333 MHz MAX MIN FPU: 6.03 3.34 SSE: 28.6 19.9 AVX128: -1.#IO 1.#IO AVX256: -1.#IO 1.#IO 4FMA128: -1.#IO 1.#IO 4FMA256: -1.#IO 1.#IO 3FMA128: -1.#IO 1.#IO 3FMA256: -1.#IO 1.#IO |
|||
06 May 2011, 00:28 |
|
Madis731 06 May 2011, 06:59
Code: Intel(R) Core(TM) i5 CPU 650 @ 3.20GHz ACTUAL CLOCK: 3333 MHz MAX MIN FPU: 1420 1.298 (MAX was about 18.2) SSE: 1.294e+005 12.9 (MAX was about 20 actually) AVX128: -1.#IO 1.#IO AVX256: -1.#IO 1.#IO 4FMA128: -1.#IO 1.#IO 4FMA256: -1.#IO 1.#IO 3FMA128: -1.#IO 1.#IO 3FMA256: -1.#IO 1.#IO the results are less stable than with the previous version. |
|||
06 May 2011, 06:59 |
|
Kuemmel 07 May 2011, 06:48
Great work ttsque ! I only found the thread now, didn't check the board for months...time for me to upgrade to AVX CPU, seems to be worth it
|
|||
07 May 2011, 06:48 |
|
tthsqe 07 May 2011, 17:12
What kind of AVX cpu are you thinking of? I'm thinking bulldozer sounds exciting, but if I understand correctly, without fused mul add instructions, the FX8000 series's peak fpu output is only half that of sandy bridge. Basically, we have:
each pair of coures has access to two 128-bit fpu units each 128-bit fpu can issue one instruction per clock (whether it be mul, add/sub or fmadd) This is in contrast to intel's fpu, which can issue a 256-bit mul and a 256-bit add/sub per clock (peak 8 double precision operations / clock). Am I understanding correctly? |
|||
07 May 2011, 17:12 |
|
Kuemmel 08 May 2011, 21:23
Hm, for the moment I was just thinking about getting some Dual Core Sandy Bridge cheap notebook to play around with it...but may be I really wait until Bulldozer will come out.
I'm really confused about that design of the Bulldozer regarding generally the FPU cores including AVX and FMA. Regarding a quick google search and the latest manuals from AMD there should be FMA -> http://support.amd.com/us/Processor_TechDocs/47414.pdf But the instruction latencies in that guide are not very good overall, may be I'm not judging it right though. Just another reason to get my old benchmark ready for a real comparison Intel seems only to add FMA when they go to 22nm. |
|||
08 May 2011, 21:23 |
|
Madis731 09 May 2011, 15:43
...and "3D transistors"
http://www.anandtech.com/print/4313 |
|||
09 May 2011, 15:43 |
|
Kuemmel 16 May 2011, 17:43
Hi ttsque,
I just looked a bit at your code. Unfortunatelly I couldn't test the AVX by myself. I see you use 64bit Win, so you got double the registers compared to my benchmark, so there should be some room for optimization, while using more registers to reduce dependencies. Could you try on your AVX quadric code, instead of Code: vmulpd ym15,ym1,ym10 vmulpd ym14,ym0,ym0 vmulpd ym1,ym1,ym1 vmulpd ym15,ym15,ym0 vsubpd ym0,ym14,ym1 vaddpd ym14,ym14,ym1 vaddpd ym1,ym15,m32[y0] vaddpd ym0,ym0,m32[x0] vmovapd m32[t],ym14 this one: Code: vmulpd ym15,ym1,ym10 vmulpd ym14,ym0,ym0 vmulpd ym1,ym1,ym1 vmulpd ym15,ym15,ym0 vsubpd ym13,ym14,ym1 vaddpd ym12,ym14,ym1 vaddpd ym0,ym13,m32[x0] vaddpd ym1,ym15,m32[y0] vmovapd m32[t],ym12 Hope I didn't f**k it up. I just used 2 of the spare ym12/13 ones you mentioned in your ReadMe to try to get rid of direct dependency. But may be it doesn't help too much, I remember a lot of try and error regarding that kind of optimizations as it doesn't seem too logic what the core does sometimes... |
|||
16 May 2011, 17:43 |
|
tthsqe 17 May 2011, 06:35
max performanced decreased very slightly (but noticeably) by about 0.3%.
My guess is that it puts more pressure on the renamer.... same dependencies, just more register names The code that really need fixed is the cubic SSE implementation - it is horrible right now. |
|||
17 May 2011, 06:35 |
|
Kuemmel 17 May 2011, 19:56
...okay, I tried my luck with the cubic SSE. I hopefully find a way (didn't run it by now...) to get rid of two "movaps". Here's the version. I hope there's no mistake, I made formula comments, so I don't get too confused.
I leave the try & error of the reordering of instructions to you. It looks like it's heavily needed, but of course may be not beneficial, like before... Code: movaps xm14,xm0 d14 = x mulpd xm14,xm0 d14 = x*x movaps xm15,xm1 d15 = y mulpd xm15,xm1 d15 = y*y movaps xm12,xm10 d12 = 3 mulpd xm12,xm14 d12 = 3*x*x movaps xm13,xm14 d13 = x*x addpd xm13,xm15 d13 = x*x + y*y movapd m16[t],xm13 save d13 subpd xm12,xm15 d12 = 3*x*x - y*y mulpd xm1,xm12 y_new = y * (- y*y + 3*x*x) = - y*y*y + 3*x*x*y addpd xm1,m16[y0] mulpd xm15,xm10 d15 = 3*y*y subpd xm14,xm15 d14 = x*x - 3*y*y mulpd xm0,xm14 x_new = x * ( x*x - 3*y*y) = x*x*x - 3*x*y*y addpd xm0,m16[x0] |
|||
17 May 2011, 19:56 |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.