flat assembler
Message board for the users of flat assembler.
![]() |
Author |
|
r22 02 Aug 2008, 02:22
I have 2 tweak suggestions.
Code: movq rax,xmm0 movhlps xmm1,xmm0 movq rbx,xmm1 To Code: movhlps xmm1,xmm0 movq rax,xmm0 movq rbx,xmm1 Avoids the WRITE then READ on XMM1 Code: shr rax,32 shr rbx,32 mov eax,[UserData+rax*4] mov ebx,[UserData+rbx*4] mov ecx,[UserData+rcx*4] mov edx,[UserData+rdx*4] To Code: mov r11,UserData shr rax,32 shr rbx,32 mov eax,[r11+rax*4] mov ebx,[r11+rbx*4] mov ecx,[r11+rcx*4] mov edx,[r11+rdx*4] Saves 5 bytes, might allow the LUT mov's to execute quicker. EDIT possibly 3 suggestions Code: add rax,rcx add rbx,rdx to Code: or rax,rcx add rbx,rdx ;; add rax,rcx ;; or rbx,rdx Might allow for better uop issuing |
|||
![]() |
|
asmfan 02 Aug 2008, 08:19
What about using pshufd /shufps/? Different execution units - int vs float.
|
|||
![]() |
|
Madis731 03 Aug 2008, 14:53
All sugestions taken into account.
-Changing MOVHLPS's position doesn't help because they seem to execute in different clock cycles anyway (explaining code follows) -"Caching" the UserData didn't help because it added an uop and it was hard to place it anywhere not to lose one clock. Constants seem to please CPU more. -ADD/SUB/AND/OR/XOR can go to all execution units (according to Agner Fog) and it didn't change uops nor clocks. I reset it to both ADD because of clearer code. SHL to ROL didn't make any difference. -PSHUFD I already explained, but SHUFPS is a similar instruction (though miraculously a bit better with uops). EDIT: 7.8 Register read stalls in Agner Fog's manual tells about the problem using mov reg1,[reg2+reg1*4] method. These 4 are all different registers that get modified only until the instruction has finished. Code: movhlps xmm1,xmm0 nop movq rax,xmm0 nop nop movq rbx,xmm1 ;This is definitely at least the 2nd clock if not 3rd mov ecx,eax mov edx,ebx nop nop shr rax,32 nop nop shr rbx,32 mov eax,[UserData+rax*4] mov ebx,[UserData+rbx*4] mov ecx,[UserData+rcx*4] mov edx,[UserData+rdx*4] shl rax,32 shl rbx,32 add rax,rcx add rbx,rdx movq xmm1,rbx movq xmm0,rax movlhps xmm0,xmm1; //7 clocks (25uops) And here we are - a perfect sample of Core 2 capable of producting 0.28CPI or 3.57IPC ![]() Btw this is NOT the only way to place nops. You can easily change the order of the first two lines etc. EDIT: I just discovered this. Pretty amazing. Too bad, that its linux, 1) My brain is not compatible with it (I haven't taught myself linux) 2) I can't afford the time installing linux just for that program ![]() EDIT2: I just checked this code on my Core 2 45nm (T9300) and it said 6 clocks ![]() EDIT3: Okay - that was too unreal. I filled it with nops and it said 6clk/25uops (remember 4instr./clk is the max). What happened, was last movlhps not retiring properly and I managed to get to the bottom of this by adding MOVDQA xmm2,xmm0 to the end. This instruction made sure that the previous one must finish writing to xmm0. 7 clocks that is then. |
|||
![]() |
|
bitRAKE 04 Aug 2008, 04:49
Madis731 wrote: EDIT: I just discovered this. Pretty amazing. Too bad, that its linux, 1) My brain is not compatible with it (I haven't taught myself linux) 2) I can't afford the time installing linux just for that program Quote: NOTE: Intel also offers a Math Kernel Library and several other libraries written in hand-coded assembly language. These libraries currently do not work, since they use a variety of exotic SSE3/MMX instructions PTLsim does not yet support. We are presently working on fixing this. |
|||
![]() |
|
revolution 14 Aug 2008, 21:40
Madis731 wrote: This time r22 nor revolution will beat me... I hope |
|||
![]() |
|
r22 15 Aug 2008, 12:36
Code: mov eax,[UserData+rax*4] mov ebx,[UserData+rbx*4] mov ecx,[UserData+rcx*4] mov edx,[UserData+rdx*4] shl rax,32 shl rbx,32 To Code: mov eax,[UserData+rax*4] mov ebx,[UserData+rbx*4] shl rax,32 shl rbx,32 mov ecx,[UserData+rcx*4] mov edx,[UserData+rdx*4] Rationale: Pretty sure I saw RtlMoveMemory in NtDll breakup mem to reg MOVs with offset ADDs in between. (And it seemed well optimized) Worth a test with the SHLs or maybe just one of the SHLs. I take a brute-force approach to opcode level optimization :p |
|||
![]() |
|
Madis731 15 Aug 2008, 18:08
Please elaborate:
Code: mov eax,[UserData+rax*4] mov ebx,[UserData+rbx*4] shl rax,32 ;Do you expect port0/1/5 usage here because shl rbx,32 ;[mem] moves saturated port4? mov ecx,[UserData+rcx*4] mov edx,[UserData+rdx*4] |
|||
![]() |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.