flat assembler
Message board for the users of flat assembler.
Index
> Projects and Ideas > fun with AVX Goto page Previous 1, 2 |
Author |
|
Kuemmel 18 May 2011, 19:49
Hm, I remember that MOVAPS can be sometimes faster (may be only on some types of Intel or AMD CPU) than MOVAPD. I think it's again try and error...I remember there might be something in the AMD or Intel Optimization manuals...
Agner has also something on that here at chapter 13.2: http://www.agner.org/optimize/optimizing_assembly.pdf May be even MOVDQA can be beneficial sometimes... |
|||
18 May 2011, 19:49 |
|
tthsqe 23 May 2011, 00:45
The new cubic code is about 9% faster. As for the actual arrangement of the instructions in the main loop, I think nothing will be faster than the intel c compiler (sorry, will test), although I think the compiler has no chance of vectorizing as well as we (humans) can.
The next step is to decrease the size of the instructions used in the main loop. vmovaps and vmovapd are the same length, but we can also use something like Code: ... ... rbp-4*1 local var -1 rbp+4*0 local var 0 rbp+4*1 local var 1 ... ... rsp-4*1 local var n-1 rsp+4*0 local var n rsp+4*1 local var n+1 .. .. instead of Code: rbp+4*0 local var 0 rbp+4*1 local var 1 ... ... rbp+4*n local var n to avoid those darn 4-byte immediates. But right now, I have interrupted development to finish implementing the istruction set extensions in fdbg. It should be done very soon!! Edit: It looks like i would be shorting if the the lower eight registers were used instead of the upper ones for calculation space. Check it out: Code: 0x000000029A: vmovapd ym0,ym1 ;C5FD28C1 0x000000029E: vmovapd xm0,xm1 ;C5F928C1 0x00000002A2: vmovapd ym0,ym15 ;C4C17D28C7 0x00000002A7: vmovapd xm0,xm15 ;C4C17928C7 0x00000002AC: vmovapd ym15,ym1 ;C57D28F9 0x00000002B0: vmovapd xm15,xm1 ;C57928F9 0x00000002B4: vmovapd ym15,ym15 ;C4417D28FF 0x00000002B9: vmovapd xm15,xm15 ;C4417928FF 0x00000002BE: vmovapd ym0,m32[r0] ;C5FD280424 0x00000002C3: vmovapd xm0,m16[r0] ;C5F9280424 0x00000002C8: vmovapd ym15,m32[r0] ;C57D283C24 0x00000002CD: vmovapd xm15,m16[r0] ;C579283C24 0x00000002D2: vmovapd ym0,m32[r15] ;C4C17D2807 0x00000002D7: vmovapd xm0,m16[r15] ;C4C1792807 0x00000002DC: vmovapd ym15,m32[r15] ;C4417D283F 0x00000002E1: vmovapd xm15,m16[r15] ;C44179283F 0x00000002E6: vmovaps ym0,ym1 ;C5FC28C1 0x00000002EA: vmovaps xm0,xm1 ;C5F828C1 0x00000002EE: vmovaps ym0,ym15 ;C4C17C28C7 0x00000002F3: vmovaps xm0,xm15 ;C4C17828C7 0x00000002F8: vmovaps ym15,ym1 ;C57C28F9 0x00000002FC: vmovaps xm15,xm1 ;C57828F9 0x0000000300: vmovaps ym15,ym15 ;C4417C28FF 0x0000000305: vmovaps xm15,xm15 ;C4417828FF 0x000000030A: vmovaps ym0,m32[r0] ;C5FC280424 0x000000030F: vmovaps xm0,m16[r0] ;C5F8280424 0x0000000314: vmovaps ym15,m32[r0] ;C57C283C24 0x0000000319: vmovaps xm15,m16[r0] ;C578283C24 0x000000031E: vmovaps ym0,m32[r15] ;C4C17C2807 0x0000000323: vmovaps xm0,m16[r15] ;C4C1782807 0x0000000328: vmovaps ym15,m32[r15] ;C4417C283F 0x000000032D: vmovaps xm15,m16[r15] ;C44178283F |
|||
23 May 2011, 00:45 |
|
Kuemmel 23 May 2011, 22:12
...interesting thing with the lenght of the MOVAPS/APD. As far as I read your code correctly you could replace your 4 calculation registers with 4 of the 8 iteration data registers. Should be shorter in the end a bit. But does shorter really always mean faster ?
I managed to assemble your code myself and tried a lot of reordering...really strange, not much helps, I guess the CPU does it internally by itself...just moving one of the addpd in the lower part looks beneficial (according to your GIPS counter it seems a bit faster), like this: Code: subpd xm12,xm15 mulpd xm1,xm12 mulpd xm15,xm10 subpd xm14,xm15 mulpd xm0,xm14 addpd xm1,m16[y0] addpd xm0,m16[x0] |
|||
23 May 2011, 22:12 |
|
tthsqe 24 May 2011, 02:53
I calculated the efficiency on a 2600K as:
(max gflops obtained) / (peak gflops assuming a packed mul and packed add every cycle) To get the max gflops, you will have to: 1. set AA to 1x 2. zoom in on the black portion 3. since the black space calculation is actually optimized, you will have to trigger a full redraw (i.e. by altering the calculation depth) |
|||
24 May 2011, 02:53 |
|
tthsqe 25 Jun 2011, 08:29
A rough multiprecision version
This one goes up to 5 dwords (48 decimal digits)
|
||||||||||||||||||||
25 Jun 2011, 08:29 |
|
Goto page Previous 1, 2 < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.