flat assembler
Message board for the users of flat assembler.

Index > Main > Fast SIMD lookup (four DWORDs looked up from DWORD location)

Author
Thread Post new topic Reply to topic
Madis731



Joined: 25 Sep 2003
Posts: 2140
Location: Estonia
Madis731
This is one of those threads that makes people think Smile
This time I didn't give up until I reached the absolute minimum. AND it uses only 4 GPRs and 2 XMMs. I won't post you the "wrong" code, but lets just say that I started with PSHUFD and it wasn't helping. From 32 clocks (20uops) to 17 clocks (14uops) down to...
you'll see if you read the code Very Happy
Why does anyone need it? Well, let the reader decide, but I think one hint is my other started topic - Perlin Noise (where lookup is accidentaly needed).
Code:
;In: XMM0 holds four DWORD values
;Purpose: This algo will find a new value for every DWORD from a LUT named UserData
;Out: XMM0 holds four LUT DWORD values
;#######################
;All measurements taken on Q6600 Core 2 with Agner's test program
;64-bit code only! (32-bit would take twice as long without movq r64,xmm)
        movq    rax,xmm0                      
        movhlps xmm1,xmm0                     
        movq    rbx,xmm1                      
        mov     ecx,eax                       
        mov     edx,ebx                       
        shr     rax,32                        
        shr     rbx,32                        
        mov     eax,[UserData+rax*4]          
        mov     ebx,[UserData+rbx*4]          
        mov     ecx,[UserData+rcx*4]          
        mov     edx,[UserData+rdx*4]          
        shl     rax,32                        
        shl     rbx,32                        
        add     rax,rcx                       
        add     rbx,rdx                       
        movq    xmm1,rbx                      
        movq    xmm0,rax                      
        movlhps xmm0,xmm1; //7 clocks (18uops) Core 2
; Think about 0.38888 CPI!!! Very Happy
    

This time r22 nor revolution will beat me... I hope

_________________
My updated idol Very Happy http://www.agner.org/optimize/
Post 01 Aug 2008, 22:29
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
I have 2 tweak suggestions.

Code:
        movq    rax,xmm0                      
        movhlps xmm1,xmm0                     
        movq    rbx,xmm1 
    

To
Code:
        movhlps xmm1,xmm0                     
        movq    rax,xmm0                      
        movq    rbx,xmm1 
    

Avoids the WRITE then READ on XMM1

Code:
        shr     rax,32                        
        shr     rbx,32                        
        mov     eax,[UserData+rax*4]          
        mov     ebx,[UserData+rbx*4]          
        mov     ecx,[UserData+rcx*4]          
        mov     edx,[UserData+rdx*4]
    

To
Code:
        mov   r11,UserData
        shr     rax,32                        
        shr     rbx,32                        
        mov     eax,[r11+rax*4]          
        mov     ebx,[r11+rbx*4]          
        mov     ecx,[r11+rcx*4]          
        mov     edx,[r11+rdx*4]
    

Saves 5 bytes, might allow the LUT mov's to execute quicker.


EDIT possibly 3 suggestions
Code:
        add     rax,rcx                       
        add     rbx,rdx 
    

to
Code:
        or     rax,rcx                       
        add     rbx,rdx 
;;     add rax,rcx
;;     or rbx,rdx
    

Might allow for better uop issuing
Post 02 Aug 2008, 02:22
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
asmfan



Joined: 11 Aug 2006
Posts: 392
Location: Russian
asmfan
What about using pshufd /shufps/? Different execution units - int vs float.
Post 02 Aug 2008, 08:19
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2140
Location: Estonia
Madis731
All sugestions taken into account.

    -Changing MOVHLPS's position doesn't help because they seem to execute in different clock cycles anyway (explaining code follows)
    -"Caching" the UserData didn't help because it added an uop and it was hard to place it anywhere not to lose one clock. Constants seem to please CPU more.
    -ADD/SUB/AND/OR/XOR can go to all execution units (according to Agner Fog) and it didn't change uops nor clocks. I reset it to both ADD because of clearer code. SHL to ROL didn't make any difference.
    -PSHUFD I already explained, but SHUFPS is a similar instruction (though miraculously a bit better with uops).

EDIT: 7.8 Register read stalls in Agner Fog's manual tells about the problem using mov reg1,[reg2+reg1*4] method. These 4 are all different registers that get modified only until the instruction has finished.
Code:
        movhlps xmm1,xmm0
        nop
        movq    rax,xmm0
        nop
        nop
        movq    rbx,xmm1 ;This is definitely at least the 2nd clock if not 3rd
        mov     ecx,eax
        mov     edx,ebx
        nop
        nop
        shr     rax,32
        nop
        nop
        shr     rbx,32
        mov     eax,[UserData+rax*4]
        mov     ebx,[UserData+rbx*4]
        mov     ecx,[UserData+rcx*4]
        mov     edx,[UserData+rdx*4]
        shl     rax,32
        shl     rbx,32
        add     rax,rcx
        add     rbx,rdx
        movq    xmm1,rbx
        movq    xmm0,rax
        movlhps xmm0,xmm1; //7 clocks (25uops)
    

And here we are - a perfect sample of Core 2 capable of producting 0.28CPI or 3.57IPC Smile These nops are good if you want to add some functions.
Btw this is NOT the only way to place nops. You can easily change the order of the first two lines etc.

EDIT: I just discovered this. Pretty amazing. Too bad, that its linux, 1) My brain is not compatible with it (I haven't taught myself linux) 2) I can't afford the time installing linux just for that program Razz

EDIT2: I just checked this code on my Core 2 45nm (T9300) and it said 6 clocks Smile I will investigate more.

EDIT3: Okay - that was too unreal. I filled it with nops and it said 6clk/25uops (remember 4instr./clk is the max). What happened, was last movlhps not retiring properly and I managed to get to the bottom of this by adding MOVDQA xmm2,xmm0 to the end. This instruction made sure that the previous one must finish writing to xmm0. 7 clocks that is then.

_________________
My updated idol Very Happy http://www.agner.org/optimize/
Post 03 Aug 2008, 14:53
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3050
Location: vpcmipstrm
bitRAKE
Madis731 wrote:
EDIT: I just discovered this. Pretty amazing. Too bad, that its linux, 1) My brain is not compatible with it (I haven't taught myself linux) 2) I can't afford the time installing linux just for that program Razz
VERY interesting indeed! Found this in the FAQ:
Quote:
NOTE: Intel also offers a Math Kernel Library and several other libraries written in hand-coded assembly language. These libraries currently do not work, since they use a variety of exotic SSE3/MMX instructions PTLsim does not yet support. We are presently working on fixing this.
...might be an interesting project to work on or move to Windows.

_________________
¯\(°_o)/¯ unlicense.org
Post 04 Aug 2008, 04:49
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17671
Location: In your JS exploiting you and your system
revolution
Madis731 wrote:
This time r22 nor revolution will beat me... I hope
Hehe, you are paranoid. Well currently I don't have a 64bit capable CPU (or OS) so your record is safe for the meantime. But anyway good job, I like to see people really making the best of something.
Post 14 Aug 2008, 21:40
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
Code:
        mov     eax,[UserData+rax*4]          
        mov     ebx,[UserData+rbx*4]          
        mov     ecx,[UserData+rcx*4]          
        mov     edx,[UserData+rdx*4]          
        shl     rax,32                        
        shl     rbx,32
    

To
Code:
        mov     eax,[UserData+rax*4]          
        mov     ebx,[UserData+rbx*4]          
        shl     rax,32                        
        shl     rbx,32
        mov     ecx,[UserData+rcx*4]          
        mov     edx,[UserData+rdx*4]          
    


Rationale: Pretty sure I saw RtlMoveMemory in NtDll breakup mem to reg MOVs with offset ADDs in between. (And it seemed well optimized) Worth a test with the SHLs or maybe just one of the SHLs. I take a brute-force approach to opcode level optimization :p
Post 15 Aug 2008, 12:36
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2140
Location: Estonia
Madis731
Please elaborate:
Code:
        mov     eax,[UserData+rax*4]
        mov     ebx,[UserData+rbx*4]
        shl     rax,32 ;Do you expect port0/1/5 usage here because
        shl     rbx,32 ;[mem] moves saturated port4?
        mov     ecx,[UserData+rcx*4]
        mov     edx,[UserData+rdx*4]
    
Post 15 Aug 2008, 18:08
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.