flat assembler
Message board for the users of flat assembler.

Index > Projects and Ideas > fun with AVX

Goto page Previous  1, 2
Author
Thread Post new topic Reply to topic
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 18 May 2011, 05:46
I know that movaps is shorter, but it is suggested that the same type be used in sequential code. I know there's a performance penalty when moving from INT <=> FLOAT, but I don't know if it matters between SINGLE <=> DOUBLE.
Post 18 May 2011, 05:46
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 18 May 2011, 19:49
Hm, I remember that MOVAPS can be sometimes faster (may be only on some types of Intel or AMD CPU) than MOVAPD. I think it's again try and error...I remember there might be something in the AMD or Intel Optimization manuals...

Agner has also something on that here at chapter 13.2:
http://www.agner.org/optimize/optimizing_assembly.pdf

May be even MOVDQA can be beneficial sometimes...
Post 18 May 2011, 19:49
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 23 May 2011, 00:45
The new cubic code is about 9% faster. As for the actual arrangement of the instructions in the main loop, I think nothing will be faster than the intel c compiler (sorry, Sad will test), although I think the compiler has no chance of vectorizing as well as we (humans) can. Very Happy
The next step is to decrease the size of the instructions used in the main loop. vmovaps and vmovapd are the same length, but we can also use something like
Code:
...
...
rbp-4*1          local var -1
rbp+4*0         local var 0
rbp+4*1         local var 1
...
...
rsp-4*1          local var n-1
rsp+4*0         local var n
rsp+4*1         local var n+1
..
..    

instead of
Code:
rbp+4*0         local var 0
rbp+4*1         local var 1
...
...
rbp+4*n         local var n    

to avoid those darn 4-byte immediates.

But right now, I have interrupted development to finish implementing the istruction set extensions in fdbg. It should be done very soon!!


Edit: It looks like i would be shorting if the the lower eight registers were used instead of the upper ones for calculation space. Check it out:
Code:
0x000000029A:         vmovapd  ym0,ym1                                        ;C5FD28C1
0x000000029E:         vmovapd  xm0,xm1                                        ;C5F928C1
0x00000002A2:         vmovapd  ym0,ym15                                       ;C4C17D28C7
0x00000002A7:         vmovapd  xm0,xm15                                       ;C4C17928C7
0x00000002AC:         vmovapd  ym15,ym1                                       ;C57D28F9
0x00000002B0:         vmovapd  xm15,xm1                                       ;C57928F9
0x00000002B4:         vmovapd  ym15,ym15                                      ;C4417D28FF
0x00000002B9:         vmovapd  xm15,xm15                                      ;C4417928FF
0x00000002BE:         vmovapd  ym0,m32[r0]                                    ;C5FD280424
0x00000002C3:         vmovapd  xm0,m16[r0]                                    ;C5F9280424
0x00000002C8:         vmovapd  ym15,m32[r0]                                   ;C57D283C24
0x00000002CD:         vmovapd  xm15,m16[r0]                                   ;C579283C24
0x00000002D2:         vmovapd  ym0,m32[r15]                                   ;C4C17D2807
0x00000002D7:         vmovapd  xm0,m16[r15]                                   ;C4C1792807
0x00000002DC:         vmovapd  ym15,m32[r15]                                  ;C4417D283F
0x00000002E1:         vmovapd  xm15,m16[r15]                                  ;C44179283F
0x00000002E6:         vmovaps  ym0,ym1                                        ;C5FC28C1
0x00000002EA:         vmovaps  xm0,xm1                                        ;C5F828C1
0x00000002EE:         vmovaps  ym0,ym15                                       ;C4C17C28C7
0x00000002F3:         vmovaps  xm0,xm15                                       ;C4C17828C7
0x00000002F8:         vmovaps  ym15,ym1                                       ;C57C28F9
0x00000002FC:         vmovaps  xm15,xm1                                       ;C57828F9
0x0000000300:         vmovaps  ym15,ym15                                      ;C4417C28FF
0x0000000305:         vmovaps  xm15,xm15                                      ;C4417828FF
0x000000030A:         vmovaps  ym0,m32[r0]                                    ;C5FC280424
0x000000030F:         vmovaps  xm0,m16[r0]                                    ;C5F8280424
0x0000000314:         vmovaps  ym15,m32[r0]                                   ;C57C283C24
0x0000000319:         vmovaps  xm15,m16[r0]                                   ;C578283C24
0x000000031E:         vmovaps  ym0,m32[r15]                                   ;C4C17C2807
0x0000000323:         vmovaps  xm0,m16[r15]                                   ;C4C1782807
0x0000000328:         vmovaps  ym15,m32[r15]                                  ;C4417C283F
0x000000032D:         vmovaps  xm15,m16[r15]                                  ;C44178283F    
Post 23 May 2011, 00:45
View user's profile Send private message Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 23 May 2011, 22:12
...interesting thing with the lenght of the MOVAPS/APD. As far as I read your code correctly you could replace your 4 calculation registers with 4 of the 8 iteration data registers. Should be shorter in the end a bit. But does shorter really always mean faster ?

I managed to assemble your code myself and tried a lot of reordering...really strange, not much helps, I guess the CPU does it internally by itself...just moving one of the addpd in the lower part looks beneficial (according to your GIPS counter it seems a bit faster), like this:
Code:
subpd  xm12,xm15       
mulpd  xm1,xm12             
mulpd  xm15,xm10    
subpd  xm14,xm15       
mulpd  xm0,xm14       
addpd  xm1,m16[y0]
addpd  xm0,m16[x0]
    
How did you calculate the efficiency in the README table anyway ?
Post 23 May 2011, 22:12
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 24 May 2011, 02:53
I calculated the efficiency on a 2600K as:
(max gflops obtained) / (peak gflops assuming a packed mul and packed add every cycle)

To get the max gflops, you will have to:
1. set AA to 1x
2. zoom in on the black portion
3. since the black space calculation is actually optimized, you will have to trigger a full redraw (i.e. by altering the calculation depth)
Post 24 May 2011, 02:53
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 25 Jun 2011, 08:29
A rough multiprecision version
This one goes up to 5 dwords (48 decimal digits)


Description: distance between pixes is about 2e-41
Filesize: 228.81 KB
Viewed: 9806 Time(s)

screen.jpg


Description:
Download
Filename: MultiQuadratic_Signed.zip
Filesize: 15.09 KB
Downloaded: 1035 Time(s)

Post 25 Jun 2011, 08:29
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.