flat assembler
Message board for the users of flat assembler.
Index
> Main > Fast SIMD lookup (four DWORDs looked up from DWORD location) |
Author |
|
r22 02 Aug 2008, 02:22
I have 2 tweak suggestions.
Code: movq rax,xmm0 movhlps xmm1,xmm0 movq rbx,xmm1 To Code: movhlps xmm1,xmm0 movq rax,xmm0 movq rbx,xmm1 Avoids the WRITE then READ on XMM1 Code: shr rax,32 shr rbx,32 mov eax,[UserData+rax*4] mov ebx,[UserData+rbx*4] mov ecx,[UserData+rcx*4] mov edx,[UserData+rdx*4] To Code: mov r11,UserData shr rax,32 shr rbx,32 mov eax,[r11+rax*4] mov ebx,[r11+rbx*4] mov ecx,[r11+rcx*4] mov edx,[r11+rdx*4] Saves 5 bytes, might allow the LUT mov's to execute quicker. EDIT possibly 3 suggestions Code: add rax,rcx add rbx,rdx to Code: or rax,rcx add rbx,rdx ;; add rax,rcx ;; or rbx,rdx Might allow for better uop issuing |
|||
02 Aug 2008, 02:22 |
|
asmfan 02 Aug 2008, 08:19
What about using pshufd /shufps/? Different execution units - int vs float.
|
|||
02 Aug 2008, 08:19 |
|
Madis731 03 Aug 2008, 14:53
All sugestions taken into account.
-Changing MOVHLPS's position doesn't help because they seem to execute in different clock cycles anyway (explaining code follows) -"Caching" the UserData didn't help because it added an uop and it was hard to place it anywhere not to lose one clock. Constants seem to please CPU more. -ADD/SUB/AND/OR/XOR can go to all execution units (according to Agner Fog) and it didn't change uops nor clocks. I reset it to both ADD because of clearer code. SHL to ROL didn't make any difference. -PSHUFD I already explained, but SHUFPS is a similar instruction (though miraculously a bit better with uops). EDIT: 7.8 Register read stalls in Agner Fog's manual tells about the problem using mov reg1,[reg2+reg1*4] method. These 4 are all different registers that get modified only until the instruction has finished. Code: movhlps xmm1,xmm0 nop movq rax,xmm0 nop nop movq rbx,xmm1 ;This is definitely at least the 2nd clock if not 3rd mov ecx,eax mov edx,ebx nop nop shr rax,32 nop nop shr rbx,32 mov eax,[UserData+rax*4] mov ebx,[UserData+rbx*4] mov ecx,[UserData+rcx*4] mov edx,[UserData+rdx*4] shl rax,32 shl rbx,32 add rax,rcx add rbx,rdx movq xmm1,rbx movq xmm0,rax movlhps xmm0,xmm1; //7 clocks (25uops) And here we are - a perfect sample of Core 2 capable of producting 0.28CPI or 3.57IPC These nops are good if you want to add some functions. Btw this is NOT the only way to place nops. You can easily change the order of the first two lines etc. EDIT: I just discovered this. Pretty amazing. Too bad, that its linux, 1) My brain is not compatible with it (I haven't taught myself linux) 2) I can't afford the time installing linux just for that program EDIT2: I just checked this code on my Core 2 45nm (T9300) and it said 6 clocks I will investigate more. EDIT3: Okay - that was too unreal. I filled it with nops and it said 6clk/25uops (remember 4instr./clk is the max). What happened, was last movlhps not retiring properly and I managed to get to the bottom of this by adding MOVDQA xmm2,xmm0 to the end. This instruction made sure that the previous one must finish writing to xmm0. 7 clocks that is then. |
|||
03 Aug 2008, 14:53 |
|
bitRAKE 04 Aug 2008, 04:49
Madis731 wrote: EDIT: I just discovered this. Pretty amazing. Too bad, that its linux, 1) My brain is not compatible with it (I haven't taught myself linux) 2) I can't afford the time installing linux just for that program Quote: NOTE: Intel also offers a Math Kernel Library and several other libraries written in hand-coded assembly language. These libraries currently do not work, since they use a variety of exotic SSE3/MMX instructions PTLsim does not yet support. We are presently working on fixing this. _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
04 Aug 2008, 04:49 |
|
revolution 14 Aug 2008, 21:40
Madis731 wrote: This time r22 nor revolution will beat me... I hope |
|||
14 Aug 2008, 21:40 |
|
r22 15 Aug 2008, 12:36
Code: mov eax,[UserData+rax*4] mov ebx,[UserData+rbx*4] mov ecx,[UserData+rcx*4] mov edx,[UserData+rdx*4] shl rax,32 shl rbx,32 To Code: mov eax,[UserData+rax*4] mov ebx,[UserData+rbx*4] shl rax,32 shl rbx,32 mov ecx,[UserData+rcx*4] mov edx,[UserData+rdx*4] Rationale: Pretty sure I saw RtlMoveMemory in NtDll breakup mem to reg MOVs with offset ADDs in between. (And it seemed well optimized) Worth a test with the SHLs or maybe just one of the SHLs. I take a brute-force approach to opcode level optimization :p |
|||
15 Aug 2008, 12:36 |
|
Madis731 15 Aug 2008, 18:08
Please elaborate:
Code: mov eax,[UserData+rax*4] mov ebx,[UserData+rbx*4] shl rax,32 ;Do you expect port0/1/5 usage here because shl rbx,32 ;[mem] moves saturated port4? mov ecx,[UserData+rcx*4] mov edx,[UserData+rdx*4] |
|||
15 Aug 2008, 18:08 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.