Fast SIMD lookup (four DWORDs looked up from DWORD location)

Index > Main > Fast SIMD lookup (four DWORDs looked up from DWORD location)

Author

Thread

Madis731

Joined: 25 Sep 2003
Posts: 2138
Location: Estonia

Madis731 01 Aug 2008, 22:29

This is one of those threads that makes people think Smile

This time I didn't give up until I reached the absolute minimum. AND it uses only 4 GPRs and 2 XMMs. I won't post you the "wrong" code, but lets just say that I started with PSHUFD and it wasn't helping. From 32 clocks (20uops) to 17 clocks (14uops) down to...
you'll see if you read the code Very Happy

Why does anyone need it? Well, let the reader decide, but I think one hint is my other started topic - Perlin Noise (where lookup is accidentaly needed).

Code:

;In: XMM0 holds four DWORD values
;Purpose: This algo will find a new value for every DWORD from a LUT named UserData
;Out: XMM0 holds four LUT DWORD values
;#######################
;All measurements taken on Q6600 Core 2 with Agner's test program
;64-bit code only! (32-bit would take twice as long without movq r64,xmm)
        movq    rax,xmm0                      
        movhlps xmm1,xmm0                     
        movq    rbx,xmm1                      
        mov     ecx,eax                       
        mov     edx,ebx                       
        shr     rax,32                        
        shr     rbx,32                        
        mov     eax,[UserData+rax*4]          
        mov     ebx,[UserData+rbx*4]          
        mov     ecx,[UserData+rcx*4]          
        mov     edx,[UserData+rdx*4]          
        shl     rax,32                        
        shl     rbx,32                        
        add     rax,rcx                       
        add     rbx,rdx                       
        movq    xmm1,rbx                      
        movq    xmm0,rax                      
        movlhps xmm0,xmm1; //7 clocks (18uops) Core 2
; Think about 0.38888 CPI!!!

This time r22 nor revolution will beat me... I hope

_________________
My updated idol Very Happy

http://www.agner.org/optimize/

01 Aug 2008, 22:29

r22

Joined: 27 Dec 2004
Posts: 805

r22 02 Aug 2008, 02:22

I have 2 tweak suggestions.

Code:

        movq    rax,xmm0                      
        movhlps xmm1,xmm0                     
        movq    rbx,xmm1

Code:

        movhlps xmm1,xmm0                     
        movq    rax,xmm0                      
        movq    rbx,xmm1

Avoids the WRITE then READ on XMM1

Code:

        shr     rax,32                        
        shr     rbx,32                        
        mov     eax,[UserData+rax*4]          
        mov     ebx,[UserData+rbx*4]          
        mov     ecx,[UserData+rcx*4]          
        mov     edx,[UserData+rdx*4]

Code:

        mov   r11,UserData
        shr     rax,32                        
        shr     rbx,32                        
        mov     eax,[r11+rax*4]          
        mov     ebx,[r11+rbx*4]          
        mov     ecx,[r11+rcx*4]          
        mov     edx,[r11+rdx*4]

Saves 5 bytes, might allow the LUT mov's to execute quicker.

EDIT possibly 3 suggestions

Code:

        add     rax,rcx                       
        add     rbx,rdx

Code:

        or     rax,rcx                       
        add     rbx,rdx 
;;     add rax,rcx
;;     or rbx,rdx

Might allow for better uop issuing

02 Aug 2008, 02:22

asmfan

Joined: 11 Aug 2006
Posts: 392
Location: Russian

asmfan 02 Aug 2008, 08:19

What about using pshufd /shufps/? Different execution units - int vs float.

02 Aug 2008, 08:19

Madis731

Joined: 25 Sep 2003
Posts: 2138
Location: Estonia

Madis731 03 Aug 2008, 14:53

All sugestions taken into account.

EDIT: 7.8 Register read stalls in Agner Fog's manual tells about the problem using mov reg1,[reg2+reg1*4] method. These 4 are all different registers that get modified only until the instruction has finished.

Code:

        movhlps xmm1,xmm0
        nop
        movq    rax,xmm0
        nop
        nop
        movq    rbx,xmm1 ;This is definitely at least the 2nd clock if not 3rd
        mov     ecx,eax
        mov     edx,ebx
        nop
        nop
        shr     rax,32
        nop
        nop
        shr     rbx,32
        mov     eax,[UserData+rax*4]
        mov     ebx,[UserData+rbx*4]
        mov     ecx,[UserData+rcx*4]
        mov     edx,[UserData+rdx*4]
        shl     rax,32
        shl     rbx,32
        add     rax,rcx
        add     rbx,rdx
        movq    xmm1,rbx
        movq    xmm0,rax
        movlhps xmm0,xmm1; //7 clocks (25uops)

And here we are - a perfect sample of Core 2 capable of producting 0.28CPI or 3.57IPC Smile

These nops are good if you want to add some functions.
Btw this is NOT the only way to place nops. You can easily change the order of the first two lines etc.

EDIT: I just discovered this. Pretty amazing. Too bad, that its linux, 1) My brain is not compatible with it (I haven't taught myself linux) 2) I can't afford the time installing linux just for that program Razz

EDIT2: I just checked this code on my Core 2 45nm (T9300) and it said 6 clocks Smile

I will investigate more.

EDIT3: Okay - that was too unreal. I filled it with nops and it said 6clk/25uops (remember 4instr./clk is the max). What happened, was last movlhps not retiring properly and I managed to get to the bottom of this by adding MOVDQA xmm2,xmm0 to the end. This instruction made sure that the previous one must finish writing to xmm0. 7 clocks that is then.

_________________
My updated idol Very Happy

http://www.agner.org/optimize/

03 Aug 2008, 14:53

bitRAKE

Joined: 21 Jul 2003
Posts: 4527
Location: vpcmpistri

bitRAKE 04 Aug 2008, 04:49

Madis731 wrote:

EDIT: I just discovered this. Pretty amazing. Too bad, that its linux, 1) My brain is not compatible with it (I haven't taught myself linux) 2) I can't afford the time installing linux just for that program

VERY interesting indeed! Found this in the FAQ:

Quote:

NOTE: Intel also offers a Math Kernel Library and several other libraries written in hand-coded assembly language. These libraries currently do not work, since they use a variety of exotic SSE3/MMX instructions PTLsim does not yet support. We are presently working on fixing this.

...might be an interesting project to work on or move to Windows.

_________________
¯\(°_o)/¯ AI may [not] have aided with the above reply.

04 Aug 2008, 04:49

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20998
Location: In your JS exploiting you and your system

revolution 14 Aug 2008, 21:40

Madis731 wrote:

This time r22 nor revolution will beat me... I hope

Hehe, you are paranoid. Well currently I don't have a 64bit capable CPU (or OS) so your record is safe for the meantime. But anyway good job, I like to see people really making the best of something.

14 Aug 2008, 21:40

r22

Joined: 27 Dec 2004
Posts: 805

r22 15 Aug 2008, 12:36

Code:

        mov     eax,[UserData+rax*4]          
        mov     ebx,[UserData+rbx*4]          
        mov     ecx,[UserData+rcx*4]          
        mov     edx,[UserData+rdx*4]          
        shl     rax,32                        
        shl     rbx,32

Code:

        mov     eax,[UserData+rax*4]          
        mov     ebx,[UserData+rbx*4]          
        shl     rax,32                        
        shl     rbx,32
        mov     ecx,[UserData+rcx*4]          
        mov     edx,[UserData+rdx*4]

Rationale: Pretty sure I saw RtlMoveMemory in NtDll breakup mem to reg MOVs with offset ADDs in between. (And it seemed well optimized) Worth a test with the SHLs or maybe just one of the SHLs. I take a brute-force approach to opcode level optimization :p

15 Aug 2008, 12:36

Madis731

Joined: 25 Sep 2003
Posts: 2138
Location: Estonia

Madis731 15 Aug 2008, 18:08

Please elaborate:

Code:

        mov     eax,[UserData+rax*4]
        mov     ebx,[UserData+rbx*4]
        shl     rax,32 ;Do you expect port0/1/5 usage here because
        shl     rbx,32 ;[mem] moves saturated port4?
        mov     ecx,[UserData+rcx*4]
        mov     edx,[UserData+rdx*4]

15 Aug 2008, 18:08

< Last Thread | Next Thread >

Forum Rules:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum