flat assembler
Message board for the users of flat assembler.
![]() Goto page Previous 1, 2, 3, 4 |
Author |
|
MazeGen 16 Dec 2005, 16:23
You're right... I have never used SSE...
Anyway, what actually means bit 12, PM, Precision Mask, in MXCSR register? What precision, and how, is influenced by this bit? |
|||
![]() |
|
LocoDelAssembly 16 Dec 2005, 16:52
mmm wait guys, I'm finding some pages where say FPU/MMX is saved BUT no for drivers. Check the links below
http://en.wikipedia.org/wiki/Talk:AMD64 http://www.freepascal.org/wiki/index.php/Win64/AMD64_API#About_x87_and_mmx_registers <- Here says the FP registers are preserved across context switching but destroyed on funtion calls |
|||
![]() |
|
r22 16 Dec 2005, 18:34
That memcpy link is for 32bit and doesn't fair well in 64bit land.
MMX movq is slower than 64bit mov reg64,mem64. I guess AMD left the 64bit memcpy to gcc and ms lib coders. As for FPU in 64bit you can always roll the dice and run fpu opcodes and just make sure you save the output and not leave it on the fpustack or fpuregs. I've never needed real10 precision though. Back On Topic: I think RtlInitUnicodeString was it for ntdll, in kernel32.dll (64bit) that function is called 260times the 2nd most called function is AllocateHeap, which I've looked at (and can't seem to find any major lack of optimization in). A lot of the other string functions in ntdll use slow scas and movsb/w/d/q instructions, but they aren't used often enough to make any difference. My next 64bit kernel optimization step will be to decompile drivers and see if they also use RtlInitUnicodeString and if so patch NTOSKRNL32.EXE with my faster version (this way drivers will also get a speed boost). |
|||
![]() |
|
Feryno 19 Dec 2005, 06:25
Out of theme of patching DLLs, but maybe help -
There is a space where to store FPU/mm regs when switching context. I am going to try it at my home AMD64 if it's right or only wasted space in context structure... (It's not wasted time for me, I am going to implement FPU/mm support in my project.) E.g. microsoft team forgot to implement set bit 1. of flags and I can't be sure if they didn't forget this same with FPU/mm regs... so I have to try it. You have to specify flag CONTEXT_MMX_REGISTERS to obtaint FPU/mm/xmm in Context structure. winnt.h: ...... typedef struct DECLSPEC_ALIGN(16) _CONTEXT { .... // // Program counter. // DWORD64 Rip; // // Floating point state. // union { XMM_SAVE_AREA32 FltSave; struct { M128A Header[2]; M128A Legacy[8]; Here is it, isn't it? M128A Xmm0; M128A Xmm1; M128A Xmm2; M128A Xmm3; M128A Xmm4; M128A Xmm5; M128A Xmm6; M128A Xmm7; M128A Xmm8; M128A Xmm9; M128A Xmm10; M128A Xmm11; M128A Xmm12; M128A Xmm13; M128A Xmm14; M128A Xmm15; }; }; // // Vector registers. // .... I never found that API preserve fpu/mm regs (and I traced kernel APIs on my asm win64 begin much, much, much...)
|
|||||||||||
![]() |
|
r22 22 Jan 2006, 05:18
I tried patching NTOSKRNL.EXE, no luck. Disabling all the protections and what not doesn't help (the protection must be hard coded into the os).
If anyone has the bandwidth to upload me the NTDLL.DLL file that comes with winXP sp2 32bit I'll patch it with the modified RtlInitUnicodeString function and you can follow the instructions a page back on how to get it running in your system. |
|||
![]() |
|
decard 22 Jan 2006, 08:44
|
|||
![]() |
|
r22 23 Jan 2006, 06:38
I got some mixed results in the 32bit version.
It seems the REP prefix + scasw method is faster for long unicode strings and slower for smaller ones. So my optimization that works 2x faster in the 64bit version only works 25% faster in the 32bit version AND only for small unicode strings. The loop unroll works slower than the rep + scasw for larger strings (in the 32bit version). While realistically the kernel won't be passing giant unicode strings to the RtlInitUnicodeString api it still makes me reluctant to patch it since teh patched version isn't faster in ALL cases. Here's the source code that I'm using to benchmark. RtlInitUnicodeString1 is the ntdll.dll api (yoinked from a debugger) RtlInitUnicodeString2 is the "optimized" version (port from 64to32bit) Code: format PE console entry start include 'c:\FASM\INCLUDE\win32a.inc' section '.code' code readable executable start: call MakeSeed ;if anything needs to be random call [GetCurrentProcess] ;returns -1 push 100h ;realtime push eax call [SetPriorityClass] call [GetCurrentThread];;returns -2 push 15 push eax call [SetThreadPriority] push _running call [printf] push 7FFFh ;;loop counter call BenchIt push 0 push 0 push 0 push 0 call [MessageBox] push 0 call [ExitProcess] BenchIt: push ebp mov ebp,esp push ebx push esi mov ebx,[ebp+8];;loop count call [GetTickCount] mov esi,eax align 8 .lp1: push TestString3 push TestBuffer call RtlInitUnicodeString1 dec ebx jnz .lp1 call [GetTickCount] sub eax,esi push eax push _func1 call [printf] ;;;;;;;;;;;;;;; push dword[TestBuffer] push _fmth call [printf] ;;;;;;;;;;;;;;;;;;; mov ebx,[ebp+8] call [GetTickCount] mov esi,eax align 8 .lp2: push TestString3 push TestBuffer call RtlInitUnicodeString2 dec ebx jnz .lp2 call [GetTickCount] sub eax,esi push eax push _func2 call [printf] ;;;;;;;;;;;;;;; push dword[TestBuffer] push _fmth call [printf] ;;;;;;;;;;;;;;;;;;; pop esi pop ebx mov esp,ebp pop ebp ret 4 align 16 RtlInitUnicodeString1: push edi mov edi, [esp+12] mov edx, [esp+8] mov dword [edx], 0 mov [edx+4], edi or edi, edi jz loc_7C90130E or ecx, 0FFFFFFFFh xor eax, eax repne scasw not ecx shl ecx, 1 cmp ecx, 0FFFEh jbe loc_7C901305 mov ecx, 0FFFEh loc_7C901305: ; CODE XREF: RtlInitUnicodeString+28j mov [edx+2], cx dec ecx dec ecx mov [edx], cx loc_7C90130E: ; CODE XREF: RtlInitUnicodeString+14j pop edi retn 8 align 16 RtlInitUnicodeString2: push ebx mov edx,[esp+12];;wide char str mov ecx,[esp+8];;buff mov dword[ecx],00h mov [ecx+4],edx test edx,edx jz .end mov ebx,edx .lp: movzx eax,word[edx] add edx,2 test eax,eax jz .okok movzx eax,word[edx] add edx,2 test eax,eax jz .okok jmp .lp .okok: sub edx,2 mov eax,0fffch sub edx,ebx cmp edx,0fffeh cmovnb edx,eax mov word[ecx],dx add dx,2 mov word[ecx+2],dx .end: pop ebx ret 8 RandomFile: push 0 push 0 push 2 ;always create push 0 push 2 ;shared write access push 40000000h ;generic write push fName call [CreateFile] mov esi,eax mov ebx,250000 ;size .tst: call Random32 mov dword[dwf],eax push 0 push wrr push 4 push dwf push esi call [WriteFile] dec ebx jnz .tst .ending: push 0 call [ExitProcess] Random32: push ebx mov eax,[RandomSeed] mov ebx,[RandomSeed+4] mov ecx,[RandomSeed+8] mov edx,[RandomSeed+12] shld ebx,eax,1 adc eax,0 ror eax,3 bswap eax shld edx,ecx,1 adc ecx,0 bswap ecx ror ecx,7 mov [RandomSeed],eax mov [RandomSeed+4],ebx mov [RandomSeed+8],ecx mov [RandomSeed+12],edx add eax,ecx pop ebx ret 0 SetSeed: .seed equ esp+4 ;,+8,+12,+16 movdqu xmm0,[.seed] movntdq dqword[RandomSeed],xmm0 ret 16 MakeSeed: rdtsc mov edx,eax call [GetTickCount] mov ecx,eax mul edx mov [RandomSeed],eax xor edx,ecx mov [RandomSeed+4],edx bswap ecx xor eax,ecx mov [RandomSeed+8],eax not edx bswap edx mul edx mov [RandomSeed+12],eax ret 0 section '.data' data readable writeable _fmth db '%x',13,10,0 _running db 'Benchmark Started',13,10,0 _func1 db 'Function1 time: %lu',13,10,0 _func2 db 'Function2 time: %lu',13,10,0 fmt db 'Lea: %lu Sub: %lu',0 buffer rb 0ffh fName db 'ENTr22.r22',0 dwf dd 0 wrr dd 0 TestString dw 10 dup(9100h), 0h TestString2 dw 11 dup(9200h), 0h TestString3 dw 32767 dup(9300h), 0h TestString4 dw 32766 dup(9400h), 0h TestString5 dw 32765 dup(9500h), 0h TestString6 dw 65535 dup(9800h), 0h TestSrc dd 4096 dup(61626364h) align 16 TestBuffer rd 4096 align 16 RandomSeed dd 1318699, 1015727, 1235239, 412943 section '.idata' import data readable writeable library kernel32,'KERNEL32.DLL',\ msvcrt,'MSVCRT.DLL',\ user32,'USER32.DLL' include "c:\FASM\INCLUDE\apia\kernel32.inc" include "c:\FASM\INCLUDE\apia\user32.inc" import msvcrt,\ printf,'printf' section '.reloc' fixups data discardable -sorry about the "c:\FASM\INCLUDE" Just goes to show crazy speed optimizations for 64bit don't always hold back in 32bit land. If someone can come up with a version that works faster in ALL situations AND is the same or equal size in bytes, when assembled, as the original function I'll patch it. |
|||
![]() |
|
Madis731 23 Jan 2006, 07:53
On a 3GHz P4 HT I got results as following:
Code: C:\Documents and Settings\Madisk\Desktop\FASM>test Benchmark Started Function1 time: 2359 fffefffc Function2 time: 1532 fffefffc Other specs: http://valid.x86-secret.com/show_oc?id=66171 EDIT: How about this: Code: RtlInitUnicodeString2: push ebx mov edx,[esp+12];;wide char str mov ecx,[esp+8];;buff mov dword[ecx],00h mov [ecx+4],edx test edx,edx jz .end push edx ; Remember edx sub edx,4 .lp: add edx,4 mov eax, [edx] mov ebx, eax not eax sub ebx, 00010001h ; You may try mmx,xmm here and eax, 80008000h ; 2x, 4x longer respectively and eax, ebx jz .lp .okok: rol eax,2 ; Make it correct, because we add edx,eax ; checked by 4-bytes before pop ebx ; Bring eDx back as eBx mov ax,0fffch sub edx,ebx cmp dx,0fffeh cmovnc edx,eax mov [ecx],dx add edx,2 mov [ecx+2],dx .end: pop ebx ret 8 display ($-RtlInitUnicodeString2)/10+30h,($-RtlInitUnicodeString2) mod 10+30h,13,10 Results: Code: Benchmark Started (TestSrc) Function1 time: 563 4010400e Function2 time: 422 4010400e Benchmark Started (Test6) Function1 time: 4656 fffefffc Function2 time: 4031 fffefffc Benchmark Started (Test5) Function1 time: 2359 fffcfffa Function2 time: 2016 fffcfffa Benchmark Started (Test4) Function1 time: 2359 fffefffc Function2 time: 1422 fffefffc Benchmark Started (Test3) Function1 time: 2359 fffefffc Function2 time: 1422 fffefffc Benchmark Started (Test2) Function1 time: 0 180016 Function2 time: 0 180016 Benchmark Started (Test) Function1 time: 0 160014 Function2 time: 0 160014 |
|||
![]() |
|
r22 24 Jan 2006, 03:16
I stayed away from using dword reads because I couldn't be sure about the memory the wide char string is in. If the wide char str ptr is not 4byte aligned it could read outside the page. But only testing will if we can get away with it or not.
Edit: After testing it on my comp (amd x2 3800+) on every benchmark it tied the ntdll's original function. I have a feeling my 64bit system can't reliably benchmark 32bit code because of architecture optimizations. Realistically ANY normal read compare loop should be faster than the REP SCASW instructions (and it is in 64bit mode, but in 32bit it's coming to a draw). It's 2 bytes too big to be hard patched as a replacement :[ |
|||
![]() |
|
Madis731 24 Jan 2006, 08:46
Is 80 bytes ok?
Code: use32 RtlInitUnicodeString2: push ebx mov edx,[esp+12];;wide char str mov ecx,[esp+8];;buff xor eax,eax mov dword[ecx],eax mov [ecx+4],edx test edx,edx jz .end push edx ; Remember edx sub edx,4 .lp: add edx,4 mov eax, [edx] mov ebx, eax not eax sub ebx, 00010001h ; You may try mmx,xmm here and eax, 80008000h ; 2x, 4x longer respectively and eax, ebx jz .lp .okok: rol eax,2 add edx,eax pop ebx ; Bring eDx back as eBx mov ax,0fffch sub edx,ebx cmp dx,0fffeh cmovnc edx,eax mov [ecx],dx inc edx inc edx mov [ecx+2],dx .end: pop ebx ret 8 display ($-RtlInitUnicodeString2)/10+30h,($-RtlInitUnicodeString2) mod 10+30h,13,10 ..or.. Code: use32 RtlInitUnicodeString2: push ebx mov edx,[esp+12];;wide char str mov ecx,[esp+8];;buff xor eax,eax mov dword[ecx],eax mov [ecx+4],edx test edx,edx jz .end push edx ; Remember edx .lp: add edx,4 mov eax, [edx-4] mov ebx, eax not eax sub ebx, 00010001h ; You may try mmx,xmm here and eax, 80008000h ; 2x, 4x longer respectively and eax, ebx jz .lp .okok: rol eax,2 lea edx,[edx+eax-4] pop ebx ; Bring eDx back as eBx mov ax,0fffch sub edx,ebx cmp dx,0fffeh cmovnc edx,eax mov [ecx],dx inc edx inc edx mov [ecx+2],dx .end: pop ebx ret 8 display ($-RtlInitUnicodeString2)/10+30h,($-RtlInitUnicodeString2) mod 10+30h,13,10 And there is a 79 byte variant that I didn't test. I hope it will work in all cases: Code: use32 RtlInitUnicodeString2: push ebx mov edx,[esp+12];;wide char str mov ecx,[esp+8];;buff xor eax,eax mov dword[ecx],eax mov [ecx+4],edx test edx,edx jz .end push edx ; Remember edx .lp: add edx,4 mov eax, [edx-4] mov ebx, eax not eax sub ebx, 00010001h ; You may try mmx,xmm here and eax, 80008000h ; 2x, 4x longer respectively and eax, ebx jz .lp .okok: rol eax,1 lea edx,[edx+eax*2-4] pop ebx ; Bring eDx back as eBx mov ax,0fffch sub edx,ebx cmp dx,0fffeh cmovnc edx,eax mov [ecx],dx inc edx inc edx mov [ecx+2],dx .end: pop ebx ret 8 display ($-RtlInitUnicodeString2)/10+30h,($-RtlInitUnicodeString2) mod 10+30h,13,10 |
|||
![]() |
|
r22 25 Jan 2006, 00:28
All great benchmark wise but the original is too small to patch over it (58 bytes). We need to come up with an optimization for the code below that compiles the same size or smaller.
Code: .text:7C9012D6 push edi .text:7C9012D7 mov edi, [esp+arg_4] .text:7C9012DB mov edx, [esp+arg_0] .text:7C9012DF mov dword ptr [edx], 0 .text:7C9012E5 mov [edx+4], edi .text:7C9012E8 or edi, edi .text:7C9012EA jz short loc_7C90130E .text:7C9012EC or ecx, 0FFFFFFFFh .text:7C9012EF xor eax, eax .text:7C9012F1 repne scasw .text:7C9012F4 not ecx .text:7C9012F6 shl ecx, 1 .text:7C9012F8 cmp ecx, 0FFFEh .text:7C9012FE jbe short loc_7C901305 .text:7C901300 mov ecx, 0FFFEh .text:7C901305 .text:7C901305 loc_7C901305: ; CODE XREF: RtlInitUnicodeString+28j .text:7C901305 mov [edx+2], cx .text:7C901309 dec ecx .text:7C90130A dec ecx .text:7C90130B mov [edx], cx .text:7C90130E .text:7C90130E loc_7C90130E: ; CODE XREF: RtlInitUnicodeString+14j .text:7C90130E pop edi .text:7C90130F retn 8 1310h - 12d6 bytes long |
|||
![]() |
|
r22 07 Feb 2006, 22:19
Quick Update:
Win32's kernel doesn't have the same issue as the Win64's so patching it wouldn't improve speed to the same extent. The tested instructions on HOW TO PATCH and the working patched dll for WIN xp64 are on this thread (a page or so back), but I won't be releasing any more patches for win xp 32 or 64. Can't wait until a final release of Vista 64, that should be fun to decompile. |
|||
![]() |
|
okasvi 07 Feb 2006, 22:25
r22 wrote: Can't wait until a final release of Vista 64, that should be fun to decompile. While you are at it, you might want to fix some bugs&possible security holes in it too? ![]() _________________ When We Ride On Our Enemies support reverse smileys |: |
|||
![]() |
|
asmfan 07 Mar 2008, 20:47
2+ years after.
Any progress on improving Vista?) Any new optimized functions? btw very interesting thread. |
|||
![]() |
|
r22 07 Mar 2008, 21:11
Unfortunately, I skipped the Vista upgrade path
![]() And sadly, no long being a student means I don't have as much free time to scroll through decompiled dll's and look for code to optimize. I may look into optimizing Linux 64bit, since recompiling the kernel isn't that big of an issue. |
|||
![]() |
|
f0dder 08 Mar 2008, 12:55
The patching method should be extended to load a support DLL, instead of overwriting the original code (and thus being severely limited in what you can do)...
Too bad you won't have any success with kernel patching, because of patchguard. |
|||
![]() |
|
Goto page Previous 1, 2, 3, 4 < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.