flat assembler
Message board for the users of flat assembler.

Index > Windows > Patching system dlls

Goto page Previous  1, 2, 3, 4
Author
Thread Post new topic Reply to topic
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 16 Dec 2005, 16:02
The problem with SSE1/2/3 is it doesn't have REAL10 ("long double" of C), it only have float and double (single and double precision respectively).

Anyone tried to make a program in C for Win64 using "long double" and disassembled it to check that effectively the program use "long double"?
Post 16 Dec 2005, 16:02
View user's profile Send private message Reply with quote
MazeGen



Joined: 06 Oct 2003
Posts: 977
Location: Czechoslovakia
MazeGen 16 Dec 2005, 16:23
You're right... I have never used SSE...

Anyway, what actually means bit 12, PM, Precision Mask, in MXCSR register? What precision, and how, is influenced by this bit?
Post 16 Dec 2005, 16:23
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 16 Dec 2005, 16:52
mmm wait guys, I'm finding some pages where say FPU/MMX is saved BUT no for drivers. Check the links below

http://en.wikipedia.org/wiki/Talk:AMD64

http://www.freepascal.org/wiki/index.php/Win64/AMD64_API#About_x87_and_mmx_registers <- Here says the FP registers are preserved across context switching but destroyed on funtion calls
Post 16 Dec 2005, 16:52
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 16 Dec 2005, 18:34
That memcpy link is for 32bit and doesn't fair well in 64bit land.
MMX movq is slower than 64bit mov reg64,mem64.
I guess AMD left the 64bit memcpy to gcc and ms lib coders.

As for FPU in 64bit you can always roll the dice and run fpu opcodes and just make sure you save the output and not leave it on the fpustack or fpuregs. I've never needed real10 precision though.

Back On Topic:
I think RtlInitUnicodeString was it for ntdll, in kernel32.dll (64bit) that function is called 260times the 2nd most called function is AllocateHeap, which I've looked at (and can't seem to find any major lack of optimization in). A lot of the other string functions in ntdll use slow scas and movsb/w/d/q instructions, but they aren't used often enough to make any difference.

My next 64bit kernel optimization step will be to decompile drivers and see if they also use RtlInitUnicodeString and if so patch NTOSKRNL32.EXE with my faster version (this way drivers will also get a speed boost).
Post 16 Dec 2005, 18:34
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Feryno



Joined: 23 Mar 2005
Posts: 514
Location: Czech republic, Slovak republic
Feryno 19 Dec 2005, 06:25
Out of theme of patching DLLs, but maybe help -

There is a space where to store FPU/mm regs when switching context.
I am going to try it at my home AMD64 if it's right or only wasted space in context structure... (It's not wasted time for me, I am going to implement FPU/mm support in my project.) E.g. microsoft team forgot to implement set bit 1. of flags and I can't be sure if they didn't forget this same with FPU/mm regs... so I have to try it.

You have to specify flag CONTEXT_MMX_REGISTERS to obtaint FPU/mm/xmm in Context structure.

winnt.h:
......
typedef struct DECLSPEC_ALIGN(16) _CONTEXT {
....
//
// Program counter.
//

DWORD64 Rip;

//
// Floating point state.
//

union {
XMM_SAVE_AREA32 FltSave;
struct {
M128A Header[2];
M128A Legacy[8]; Here is it, isn't it?
M128A Xmm0;
M128A Xmm1;
M128A Xmm2;
M128A Xmm3;
M128A Xmm4;
M128A Xmm5;
M128A Xmm6;
M128A Xmm7;
M128A Xmm8;
M128A Xmm9;
M128A Xmm10;
M128A Xmm11;
M128A Xmm12;
M128A Xmm13;
M128A Xmm14;
M128A Xmm15;
};
};

//
// Vector registers.
//
....

I never found that API preserve fpu/mm regs (and I traced kernel APIs on my asm win64 begin much, much, much...)


Description: win64 context structure from winnt.h DDK file
Download
Filename: winddk.zip
Filesize: 2.37 KB
Downloaded: 424 Time(s)

Post 19 Dec 2005, 06:25
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 22 Jan 2006, 05:18
I tried patching NTOSKRNL.EXE, no luck. Disabling all the protections and what not doesn't help (the protection must be hard coded into the os).

If anyone has the bandwidth to upload me the NTDLL.DLL file that comes with winXP sp2 32bit I'll patch it with the modified RtlInitUnicodeString function and you can follow the instructions a page back on how to get it running in your system.
Post 22 Jan 2006, 05:18
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
decard



Joined: 11 Sep 2003
Posts: 1092
Location: Poland
decard 22 Jan 2006, 08:44
Post 22 Jan 2006, 08:44
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 23 Jan 2006, 06:38
I got some mixed results in the 32bit version.
It seems the REP prefix + scasw method is faster for long unicode strings and slower for smaller ones.

So my optimization that works 2x faster in the 64bit version only works 25% faster in the 32bit version AND only for small unicode strings. The loop unroll works slower than the rep + scasw for larger strings (in the 32bit version).

While realistically the kernel won't be passing giant unicode strings to the RtlInitUnicodeString api it still makes me reluctant to patch it since teh patched version isn't faster in ALL cases.

Here's the source code that I'm using to benchmark.
RtlInitUnicodeString1 is the ntdll.dll api (yoinked from a debugger)
RtlInitUnicodeString2 is the "optimized" version (port from 64to32bit)

Code:
format PE console
entry start

include 'c:\FASM\INCLUDE\win32a.inc'

section '.code' code readable executable
start:
  call MakeSeed ;if anything needs to be random
        call    [GetCurrentProcess] ;returns -1
        push    100h ;realtime
        push    eax
        call    [SetPriorityClass]
        call    [GetCurrentThread];;returns -2
        push    15
        push    eax
        call    [SetThreadPriority]
        push    _running
        call    [printf]
        push    7FFFh ;;loop counter
        call    BenchIt
        push    0
        push    0
        push    0
        push    0
        call    [MessageBox]
        push    0
        call    [ExitProcess]

BenchIt:
        push    ebp
        mov     ebp,esp
        push    ebx
        push    esi
        mov     ebx,[ebp+8];;loop count
        call    [GetTickCount]
        mov     esi,eax
align 8
    .lp1:
        push    TestString3
        push    TestBuffer
        call    RtlInitUnicodeString1
        dec     ebx
        jnz     .lp1
        call    [GetTickCount]
        sub     eax,esi
        push    eax
        push    _func1
        call    [printf]
        ;;;;;;;;;;;;;;;
        push    dword[TestBuffer]
        push    _fmth
        call    [printf]
        ;;;;;;;;;;;;;;;;;;;
        mov     ebx,[ebp+8]
        call    [GetTickCount]
        mov     esi,eax
align 8
    .lp2:
        push    TestString3
        push    TestBuffer
        call    RtlInitUnicodeString2
        dec     ebx
        jnz     .lp2
        call    [GetTickCount]
        sub     eax,esi
        push    eax
        push    _func2
        call    [printf]
        ;;;;;;;;;;;;;;;
        push    dword[TestBuffer]
        push    _fmth
        call    [printf]
        ;;;;;;;;;;;;;;;;;;;
        pop     esi
        pop     ebx
        mov     esp,ebp
        pop     ebp
        ret     4

align 16
RtlInitUnicodeString1:
        push    edi
        mov     edi, [esp+12]
        mov     edx, [esp+8]
        mov     dword [edx], 0
        mov     [edx+4], edi
        or      edi, edi
        jz      loc_7C90130E
        or      ecx, 0FFFFFFFFh
        xor     eax, eax
        repne scasw
        not     ecx
        shl     ecx, 1
        cmp     ecx, 0FFFEh
        jbe     loc_7C901305
        mov     ecx, 0FFFEh
 loc_7C901305:                           ; CODE XREF: RtlInitUnicodeString+28j
        mov     [edx+2], cx
        dec     ecx
        dec     ecx
        mov     [edx], cx
 loc_7C90130E:                           ; CODE XREF: RtlInitUnicodeString+14j
        pop     edi
        retn    8

align 16
RtlInitUnicodeString2:
        push    ebx
        mov     edx,[esp+12];;wide char str
        mov     ecx,[esp+8];;buff
        mov     dword[ecx],00h
        mov     [ecx+4],edx
        test    edx,edx
        jz      .end
        mov     ebx,edx
   .lp:
        movzx   eax,word[edx]
        add     edx,2
        test    eax,eax
        jz      .okok
        movzx   eax,word[edx]
        add     edx,2
        test    eax,eax
        jz      .okok
        jmp     .lp
   .okok:
        sub     edx,2
        mov     eax,0fffch
        sub     edx,ebx
        cmp     edx,0fffeh
        cmovnb  edx,eax
        mov     word[ecx],dx
        add     dx,2
        mov     word[ecx+2],dx
   .end:
        pop     ebx
        ret     8

RandomFile:
  push 0
  push 0
  push 2 ;always create
  push 0
  push 2 ;shared write access
  push 40000000h ;generic write
  push fName
  call [CreateFile]
  mov esi,eax
  mov ebx,250000 ;size
.tst:
  call Random32
  mov dword[dwf],eax
  push 0
  push wrr
  push 4
  push dwf
  push esi
  call [WriteFile]
  dec ebx
  jnz .tst

  .ending:
     push 0
     call [ExitProcess]


Random32:
     push ebx
     mov eax,[RandomSeed]
     mov ebx,[RandomSeed+4]
     mov ecx,[RandomSeed+8]
     mov edx,[RandomSeed+12]
     shld ebx,eax,1
     adc eax,0
     ror eax,3
     bswap eax
     shld edx,ecx,1
     adc ecx,0
     bswap ecx
     ror ecx,7
     mov [RandomSeed],eax
     mov [RandomSeed+4],ebx
     mov [RandomSeed+8],ecx
     mov [RandomSeed+12],edx
   add eax,ecx
     pop ebx
     ret 0

SetSeed:
.seed equ esp+4 ;,+8,+12,+16
     movdqu xmm0,[.seed]
     movntdq dqword[RandomSeed],xmm0
     ret 16

MakeSeed:
     rdtsc
     mov edx,eax
     call [GetTickCount]
     mov ecx,eax
     mul edx
     mov [RandomSeed],eax
     xor edx,ecx
     mov [RandomSeed+4],edx
     bswap ecx
     xor eax,ecx
     mov [RandomSeed+8],eax
     not edx
     bswap edx
     mul edx
     mov [RandomSeed+12],eax
     ret 0



section '.data' data readable writeable
_fmth db '%x',13,10,0
_running db 'Benchmark Started',13,10,0
_func1 db 'Function1 time: %lu',13,10,0
_func2 db 'Function2 time: %lu',13,10,0
fmt db 'Lea: %lu  Sub: %lu',0
buffer rb 0ffh
fName db 'ENTr22.r22',0
dwf dd 0
wrr dd 0
   TestString dw 10 dup(9100h), 0h
   TestString2 dw 11 dup(9200h), 0h
   TestString3 dw 32767 dup(9300h), 0h
   TestString4 dw 32766 dup(9400h), 0h
   TestString5 dw 32765 dup(9500h), 0h
   TestString6 dw 65535 dup(9800h), 0h

   TestSrc dd 4096 dup(61626364h)
align 16
   TestBuffer rd 4096

align 16
RandomSeed dd 1318699, 1015727, 1235239, 412943


section '.idata' import data readable writeable

  library kernel32,'KERNEL32.DLL',\
          msvcrt,'MSVCRT.DLL',\
          user32,'USER32.DLL'
      include  "c:\FASM\INCLUDE\apia\kernel32.inc"
      include  "c:\FASM\INCLUDE\apia\user32.inc"


  import msvcrt,\
         printf,'printf'

section '.reloc' fixups data discardable
    

-sorry about the "c:\FASM\INCLUDE"

Just goes to show crazy speed optimizations for 64bit don't always hold back in 32bit land.

If someone can come up with a version that works faster in ALL situations AND is the same or equal size in bytes, when assembled, as the original function I'll patch it.
Post 23 Jan 2006, 06:38
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 23 Jan 2006, 07:53
On a 3GHz P4 HT I got results as following:
Code:
C:\Documents and Settings\Madisk\Desktop\FASM>test
Benchmark Started
Function1 time: 2359
fffefffc
Function2 time: 1532
fffefffc
    

Other specs:
http://valid.x86-secret.com/show_oc?id=66171

EDIT:
How about this:
Code:
    RtlInitUnicodeString2:
            push    ebx
            mov     edx,[esp+12];;wide char str
            mov     ecx,[esp+8];;buff
            mov     dword[ecx],00h
            mov     [ecx+4],edx
            test    edx,edx
            jz      .end

            push    edx                      ; Remember edx
            sub     edx,4
       .lp:
            add     edx,4
            mov     eax, [edx]
            mov     ebx, eax
            not     eax
            sub     ebx, 00010001h           ; You may try mmx,xmm here
            and     eax, 80008000h           ; 2x, 4x longer respectively
            and     eax, ebx
            jz      .lp
       .okok:
            rol     eax,2                    ; Make it correct, because we
            add     edx,eax                  ; checked by 4-bytes before
            pop     ebx                      ; Bring eDx back as eBx
            mov     ax,0fffch
            sub     edx,ebx
            cmp     dx,0fffeh
            cmovnc  edx,eax
            mov     [ecx],dx
            add     edx,2
            mov     [ecx+2],dx
       .end:
            pop     ebx
            ret     8
display ($-RtlInitUnicodeString2)/10+30h,($-RtlInitUnicodeString2) mod 10+30h,13,10
    

Results:
Code:
Benchmark Started (TestSrc)
Function1 time: 563
4010400e
Function2 time: 422
4010400e

Benchmark Started (Test6)
Function1 time: 4656
fffefffc
Function2 time: 4031
fffefffc

Benchmark Started (Test5)
Function1 time: 2359
fffcfffa
Function2 time: 2016
fffcfffa

Benchmark Started (Test4)
Function1 time: 2359
fffefffc
Function2 time: 1422
fffefffc

Benchmark Started (Test3)
Function1 time: 2359
fffefffc
Function2 time: 1422
fffefffc

Benchmark Started (Test2)
Function1 time: 0
180016
Function2 time: 0
180016

Benchmark Started (Test)
Function1 time: 0
160014
Function2 time: 0
160014
    
Post 23 Jan 2006, 07:53
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 24 Jan 2006, 03:16
I stayed away from using dword reads because I couldn't be sure about the memory the wide char string is in. If the wide char str ptr is not 4byte aligned it could read outside the page. But only testing will if we can get away with it or not.

Edit:
After testing it on my comp (amd x2 3800+) on every benchmark it tied the ntdll's original function. I have a feeling my 64bit system can't reliably benchmark 32bit code because of architecture optimizations. Realistically ANY normal read compare loop should be faster than the REP SCASW instructions (and it is in 64bit mode, but in 32bit it's coming to a draw).

It's 2 bytes too big to be hard patched as a replacement :[
Post 24 Jan 2006, 03:16
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 24 Jan 2006, 08:46
Is 80 bytes ok?
Code:
use32
    RtlInitUnicodeString2:
            push    ebx
            mov     edx,[esp+12];;wide char str
            mov     ecx,[esp+8];;buff
            xor     eax,eax
            mov     dword[ecx],eax
            mov     [ecx+4],edx
            test    edx,edx
            jz      .end

            push    edx                      ; Remember edx
            sub     edx,4
       .lp:
            add     edx,4
            mov     eax, [edx]
            mov     ebx, eax
            not     eax
            sub     ebx, 00010001h           ; You may try mmx,xmm here
            and     eax, 80008000h           ; 2x, 4x longer respectively
            and     eax, ebx
            jz      .lp
       .okok:
            rol     eax,2
            add     edx,eax
            pop     ebx                      ; Bring eDx back as eBx
            mov     ax,0fffch
            sub     edx,ebx
            cmp     dx,0fffeh
            cmovnc  edx,eax
            mov     [ecx],dx
            inc     edx
            inc     edx
            mov     [ecx+2],dx
       .end:
            pop     ebx
            ret     8
display ($-RtlInitUnicodeString2)/10+30h,($-RtlInitUnicodeString2) mod 10+30h,13,10
    


..or..

Code:
use32
    RtlInitUnicodeString2:
            push    ebx
            mov     edx,[esp+12];;wide char str
            mov     ecx,[esp+8];;buff
            xor     eax,eax
            mov     dword[ecx],eax
            mov     [ecx+4],edx
            test    edx,edx
            jz      .end

            push    edx                      ; Remember edx
       .lp:
            add     edx,4
            mov     eax, [edx-4]
            mov     ebx, eax
            not     eax
            sub     ebx, 00010001h           ; You may try mmx,xmm here
            and     eax, 80008000h           ; 2x, 4x longer respectively
            and     eax, ebx
            jz      .lp
       .okok:
            rol     eax,2
            lea     edx,[edx+eax-4]
            pop     ebx                      ; Bring eDx back as eBx
            mov     ax,0fffch
            sub     edx,ebx
            cmp     dx,0fffeh
            cmovnc  edx,eax
            mov     [ecx],dx
            inc     edx
            inc     edx
            mov     [ecx+2],dx
       .end:
            pop     ebx
            ret     8
display ($-RtlInitUnicodeString2)/10+30h,($-RtlInitUnicodeString2) mod 10+30h,13,10
    


And there is a 79 byte variant that I didn't test. I hope it will work in all cases:
Code:
use32
    RtlInitUnicodeString2:
            push    ebx
            mov     edx,[esp+12];;wide char str
            mov     ecx,[esp+8];;buff
            xor     eax,eax
            mov     dword[ecx],eax
            mov     [ecx+4],edx
            test    edx,edx
            jz      .end

            push    edx                      ; Remember edx
       .lp:
            add     edx,4
            mov     eax, [edx-4]
            mov     ebx, eax
            not     eax
            sub     ebx, 00010001h           ; You may try mmx,xmm here
            and     eax, 80008000h           ; 2x, 4x longer respectively
            and     eax, ebx
            jz      .lp
       .okok:
            rol     eax,1
            lea     edx,[edx+eax*2-4]
            pop     ebx                      ; Bring eDx back as eBx
            mov     ax,0fffch
            sub     edx,ebx
            cmp     dx,0fffeh
            cmovnc  edx,eax
            mov     [ecx],dx
            inc     edx
            inc     edx
            mov     [ecx+2],dx
       .end:
            pop     ebx
            ret     8
display ($-RtlInitUnicodeString2)/10+30h,($-RtlInitUnicodeString2) mod 10+30h,13,10
    
Post 24 Jan 2006, 08:46
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 25 Jan 2006, 00:28
All great benchmark wise but the original is too small to patch over it (58 bytes). We need to come up with an optimization for the code below that compiles the same size or smaller.

Code:
.text:7C9012D6                 push    edi
.text:7C9012D7                 mov     edi, [esp+arg_4]
.text:7C9012DB                 mov     edx, [esp+arg_0]
.text:7C9012DF                 mov     dword ptr [edx], 0
.text:7C9012E5                 mov     [edx+4], edi
.text:7C9012E8                 or      edi, edi
.text:7C9012EA                 jz      short loc_7C90130E
.text:7C9012EC                 or      ecx, 0FFFFFFFFh
.text:7C9012EF                 xor     eax, eax
.text:7C9012F1                 repne scasw
.text:7C9012F4                 not     ecx
.text:7C9012F6                 shl     ecx, 1
.text:7C9012F8                 cmp     ecx, 0FFFEh
.text:7C9012FE                 jbe     short loc_7C901305
.text:7C901300                 mov     ecx, 0FFFEh
.text:7C901305
.text:7C901305 loc_7C901305:                           ; CODE XREF: RtlInitUnicodeString+28j
.text:7C901305                 mov     [edx+2], cx
.text:7C901309                 dec     ecx
.text:7C90130A                 dec     ecx
.text:7C90130B                 mov     [edx], cx
.text:7C90130E
.text:7C90130E loc_7C90130E:                           ; CODE XREF: RtlInitUnicodeString+14j
.text:7C90130E                 pop     edi
.text:7C90130F                 retn    8
    

1310h - 12d6 bytes long
Post 25 Jan 2006, 00:28
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 07 Feb 2006, 22:19
Quick Update:

Win32's kernel doesn't have the same issue as the Win64's so patching it wouldn't improve speed to the same extent.

The tested instructions on HOW TO PATCH and the working patched dll for WIN xp64 are on this thread (a page or so back), but I won't be releasing any more patches for win xp 32 or 64.

Can't wait until a final release of Vista 64, that should be fun to decompile.
Post 07 Feb 2006, 22:19
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
okasvi



Joined: 18 Aug 2005
Posts: 382
Location: Finland
okasvi 07 Feb 2006, 22:25
r22 wrote:
Can't wait until a final release of Vista 64, that should be fun to decompile.


While you are at it, you might want to fix some bugs&possible security holes in it too? Twisted Evil

_________________
When We Ride On Our Enemies
support reverse smileys |:
Post 07 Feb 2006, 22:25
View user's profile Send private message MSN Messenger Reply with quote
asmfan



Joined: 11 Aug 2006
Posts: 392
Location: Russian
asmfan 07 Mar 2008, 20:47
2+ years after.
Any progress on improving Vista?) Any new optimized functions? btw very interesting thread.
Post 07 Mar 2008, 20:47
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 07 Mar 2008, 21:11
Unfortunately, I skipped the Vista upgrade path Razz

And sadly, no long being a student means I don't have as much free time to scroll through decompiled dll's and look for code to optimize.

I may look into optimizing Linux 64bit, since recompiling the kernel isn't that big of an issue.
Post 07 Mar 2008, 21:11
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 08 Mar 2008, 12:55
The patching method should be extended to load a support DLL, instead of overwriting the original code (and thus being severely limited in what you can do)...

Too bad you won't have any success with kernel patching, because of patchguard.
Post 08 Mar 2008, 12:55
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3, 4

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.