flat assembler
Message board for the users of flat assembler.

Index > Windows > Patching system dlls

Goto page Previous  1, 2, 3, 4  Next
Author
Thread Post new topic Reply to topic
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 06 Dec 2005, 12:29
gunblade, I know he is patching NTDLL.DLL, I'm just proposing optimizing it in a legal way so he can publish his work with us Very Happy

Distributing a patcher for a system DLL is ilegal (I suppose), but distributing a new system DLL which relies on the original for the not reimplemented funtions maybe is legal.

I will try something around this but at this time I loose my Athlon64, my WD-2000JS (hard disk) has been broken Crying or Very sad with all the unbackuped data inside Crying or Very sad Crying or Very sad
Post 06 Dec 2005, 12:29
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 07 Dec 2005, 06:14
disabled chkdsk, got rid of the dllcache folder.
But there's no method for disabling SFC for win64,
the only things I've found are how to disable it for winxp and xp sp2(which uses the older 32bit dll sfc_os).

Another annoying fact about win64, you can't install the recovery console to the HD to make it bootable, so every time I mess around I have to wait for the CD recovery console to copy files and initialize.

I need to learn how MS updates their dll files avoiding all this mess.
Post 07 Dec 2005, 06:14
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
gunblade



Joined: 19 Feb 2004
Posts: 209
gunblade 07 Dec 2005, 12:25
Well, you could download a service pack and examine the differences between the old and new dll's. Although there might not be a service pack for win64 yet. Another thing, have you checked the code actually works fine for all input and such? Maybe windows uses the new function for something during bootup, and if theres a bug, then it causes it to crash, and therefore the BSOD. Not doubting your coding correctness, it's just a possibility Smile
Post 07 Dec 2005, 12:25
View user's profile Send private message Reply with quote
Matrix



Joined: 04 Sep 2004
Posts: 1166
Location: Overflow
Matrix 07 Dec 2005, 16:17
you must be really bored if you are willing to patch windows dlls ;>>

its a never ending job
Post 07 Dec 2005, 16:17
View user's profile Send private message Visit poster's website Reply with quote
Torrey



Joined: 12 Oct 2003
Posts: 78
Torrey 08 Dec 2005, 04:40
I haven't worked with a 64-bit system yet, but is there 64-bit code within ntdll.dll ? I'm also not certain of this, but could it be possible that the OS boots in 32-bit protected mode and uses the ntdll.dll before switching over to 64-bit? This would cause dll problems since those opcodes aren't available then.

[edit]

Quote:
WOW64 Implementation Details
The WOW64 emulator runs in user mode, provides an interface between the 32-bit version of Ntdll.dll and the kernel of the processor, and it intercepts kernel calls. The emulator consists of the following DLLs:



- Wow64.dll provides the core emulation infrastructure and the thunks for the Ntoskrnl.exe entry-point functions.
- Wow64Win.dll provides thunks for the Win32k.sys entry-point functions.
- Wow64Cpu.dll provides x86 instruction emulation on Itanium processors. It executes mode-switch instructions on the processor. This DLL is not necessary for x64 processors because they execute x86-32 instructions at full clock speed.

Along with the 64-bit version of Ntdll.dll, these are the only 64-bit binaries that can be loaded into a 32-bit process.

At startup, Wow64.dll loads the x86 version of Ntdll.dll and runs its initialization code, which loads all necessary 32-bit DLLs. Almost all 32-bit DLLs are unmodified copies of 32-bit Windows binaries. However, some of these DLLs are written to behave differently on WOW64 than they do on 32-bit Windows, usually because they share memory with 64-bit system components. All user mode address space above the 32-bit limits (2 GB for most applications, 4 GB for applications marked with the IMAGE_FILE_LARGE_ADDRESS_AWARE flag in the image header) is reserved by the system.

Instead of using the x86 system-service call sequence, 32-bit binaries that make system calls are rebuilt to use a custom calling sequence. This new sequence is inexpensive for WOW64 to intercept because it remains entirely in user mode. When the new calling sequence is detected, the WOW64 CPU transitions back to native 64-bit mode and calls into Wow64.dll. Thunking is done in user mode to reduce the impact on the 64-bit kernel, and to reduce the risk of a bug in the thunk that causes a kernel-mode crash, data corruption, or a security hole. The thunks extract arguments from the 32-bit stack, extend them to 64 bits, then make the native system call.


By reading the above it sounds like there are two different versions of ntdll.dll on a 64-bit system.
Post 08 Dec 2005, 04:40
View user's profile Send private message Visit poster's website Reply with quote
gunblade



Joined: 19 Feb 2004
Posts: 209
gunblade 08 Dec 2005, 12:59
You may be on to something Torrey,
r22, see if you can convert your code to 32-bit, just for the moment being, and put that into the dll. Then see if that causes the same problem as before (remember to check the CRC and such).
Or find a way to print out what mode the cpu is running in.
Post 08 Dec 2005, 12:59
View user's profile Send private message Reply with quote
lilljocke



Joined: 28 Dec 2004
Posts: 34
Location: Sweden
lilljocke 08 Dec 2005, 17:15
Hi i like your idea actually i have thought about it my self. I don't know how to remove the Windows file protection but i know a simple DOS program that can do it, See the attachment. I hope that program can help you, you can debug it and see what it do. I wonder if i could help you with something becouse I am interested in this conversion thing. I know MASM and FASM.


Description: Windows System File Replacer
Download
Filename: Replacer.zip
Filesize: 8.9 KB
Downloaded: 551 Time(s)

Post 08 Dec 2005, 17:15
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 09 Dec 2005, 02:54
Windows/system32 holds all the 64bit versions of the system dlls
while windows/wow64 holds all the 32bit versions
I'm aware of the ambiguity in the names.

So yes there are two ntdll.dll files one in each directory with essentially the same functions but different code 64bit in system32/ntdll.dll and 32bit in wow64/ntdll.dll.

Right now I'm going to edit 1byte in the 64bit ntdll.dll, hex edit the correct CRC in the header.
If I can get this to work with my current replacement method then I'll have to go back to testing the integrity of my replacement functions code. Although my initial tests showed my RtlInitUnicodeString function performing exactly the same as ntdlls (only 2.2x faster) in every situation I could come up with.
Post 09 Dec 2005, 02:54
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 09 Dec 2005, 03:26
*edit for the above post windows/wow64/ I meant windows/SysWow64 directory.

Ok I successfully nop nop 'd a mov Rax,Rax in the ntdll, fixed the checksum in the head and replaced it with the original, restarted and it is still there. So there must be an issue with how I either hard patched the whole function OR my function in general.

Working NTDLL.DLL RtlInitUnicodeString
Code:
.text:0000000078EF4120                 nop
.text:0000000078EF4121                 nop
.text:0000000078EF4122                 push    rdi
.text:0000000078EF4123                 test    rdx, rdx
.text:0000000078EF4126                 mov     r8, rcx
.text:0000000078EF4129                 mov     word ptr [rcx+2], 0
.text:0000000078EF412F                 mov     word ptr [rcx], 0
.text:0000000078EF4134                 mov     [rcx+8], rdx
.text:0000000078EF4138                 jz      short loc_78EF4172
.text:0000000078EF413A                 xor     eax, eax
.text:0000000078EF413C                 mov     rcx, 0FFFFFFFFFFFFFFFFh
.text:0000000078EF4146                 mov     rdi, rdx
.text:0000000078EF4149                 repne scasw
.text:0000000078EF414C                 mov     eax, 0FFFCh
.text:0000000078EF4151                 not     rcx
.text:0000000078EF4154                 dec     rcx
.text:0000000078EF4157                 add     rcx, rcx
.text:0000000078EF415A                 cmp     rcx, 0FFFEh
.text:0000000078EF4161                 cmovnb  rcx, rax
.text:0000000078EF4165                 mov     [r8], cx
.text:0000000078EF4169                 add     cx, 2
.text:0000000078EF416D                 mov     [r8+2], cx
.text:0000000078EF4172
.text:0000000078EF4172 loc_78EF4172:                           ; CODE XREF: RtlInitUnicodeString+18j
.text:0000000078EF4172                 pop     rdi
.text:0000000078EF4173                 retn
    


My ntdll.dll RtlInitUnicodeString (this one freezes at bootup
Code:
.text:0000000078EF4120                 xor     r9d, r9d
.text:0000000078EF4123                 test    rdx, rdx
.text:0000000078EF4126                 mov     [rcx], r9d
.text:0000000078EF4129                 mov     [rcx+8], rdx
.text:0000000078EF412D                 jz      short locret_78EF4174
.text:0000000078EF412F                 mov     r8, rdx
.text:0000000078EF4132
.text:0000000078EF4132 loc_78EF4132:                           ; CODE XREF: RtlInitUnicodeString+28j
.text:0000000078EF4132                 movzx   eax, word ptr [rdx]
.text:0000000078EF4135                 add     rdx, 2
.text:0000000078EF4139                 test    eax, eax
.text:0000000078EF413B                 jz      short loc_78EF4150
.text:0000000078EF413D                 movzx   eax, word ptr [rdx]
.text:0000000078EF4140                 add     rdx, 2
.text:0000000078EF4144                 test    eax, eax
.text:0000000078EF4146                 jz      short loc_78EF4150
.text:0000000078EF4148                 jmp     short loc_78EF4132
.text:0000000078EF4148 ; ---------------------------------------------------------------------------
.text:0000000078EF414A                 align 10h
.text:0000000078EF4150
.text:0000000078EF4150 loc_78EF4150:                           ; CODE XREF: RtlInitUnicodeString+1Bj
.text:0000000078EF4150                                         ; RtlInitUnicodeString+26j
.text:0000000078EF4150                 sub     rdx, 2
.text:0000000078EF4154                 mov     eax, 0FFFCh
.text:0000000078EF4159                 sub     rdx, r8
.text:0000000078EF415C                 cmp     rdx, 0FFFEh
.text:0000000078EF4163                 cmovnb  rdx, rax
.text:0000000078EF4167                 mov     [ecx], dx
.text:0000000078EF416B                 add     dx, 2
.text:0000000078EF416F                 mov     [ecx+2], dx
.text:0000000078EF4174
.text:0000000078EF4174 locret_78EF4174:                        ; CODE XREF: RtlInitUnicodeString+Dj
.text:0000000078EF4174                 retn
    


From testing both these functions they should perform the same task in ALL circumstances, maybe using R9D in my function causes problems some how, but I thought R9D didnt have to be saved.

Now that I code one replace to FINALLY work I'll just incrementally change the bytes until I can get my faster function (then functions) working on my system.

You can all be sure that I'll report any insane gains in system performance Very Happy.

OMG plain as day did you catch it MOV [ECX], DX instead of RCX
LOL no wonder my system froze up so hard trying to run that code, how did this code work flawlessly in my tests ?!
Post 09 Dec 2005, 03:26
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 09 Dec 2005, 05:55
Apparently the same error checking built into win64 that replaces movdqa with movdqu to make it work does the same with mov [ecx],blah to mov [rcx],blah so while my errored function worked fine while I was testing it, during the boot up process it would fail miserably.

===================================
HOW TO PATCH WIN XP 64BIT SYSTEM DLLS
===================================
1. Disable CHKDSK
chkntfs /x c:
replace c: with whatever drive your win64 is installed on
2. Disable (well give warning message) for SFC
Rename your dllcache folder in windows/system32/ to bakdllcache or delete it your risk.
Start | run | regedt32
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Winlogon
Set key SFCDisable REG_DWORD to 0xffffff9d
Set key SfcQuota REG_DWORD to 0
3. Shutdown and restart for good measure.
4. Go to your windows/system32/ folder and rename the dll your replacing to dllname.dll.bak (so if you kill your box you can fix it with the recovery console) right after the rename PASTE your modified dll into the directory.
5. SFC will now give you a warning popup message Click cancel then Click Yes.
6. Shutdown and Restart and hopefully your windows 64 will load. If not recovery console and delete the modified file and rename the .bak file to the original name and restart.

======================================
Why modify your precious win64 kernel ?????????
======================================
JUST FROM REPLACING RTLINITUNICODESTRING WITH A FUNCTION THAT RUNS 2X FASTER, MY SYSTEM NOW BOOTS UP 10% FASTER AND 64BIT APPLICATIONS LOAD 10% FASTER.

The 10% is an estimate (on the low side since I have no empiracle way of testing).

Why RtlInitUnicodeString????????????????????????
Well when you have free time like me you decompile windows/system32/kernel32.dll and relize all the suffix A (ascii) functions call the suffix W (wide char) functions and all the suffix W functions call RtlInitUnicodeString. So by replacing this function with a faster one your essentially optimizing ~50% of the functions in your kernel.

The file, has a correct checksum and if you run a difference comparison on this file and your ntdll.dll (windows/system32/) you'll see only the 3bytes for the header checksum are changed and the RtlInitUnicode function is changed.


Description: This works on my winxp64 and gives me a recognizable boost in speed. Use at your own risk, M$ might freak out blah blah blah.
Download
Filename: ntdll.dll.x86.64.zip
Filesize: 527.02 KB
Downloaded: 538 Time(s)

Post 09 Dec 2005, 05:55
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Torrey



Joined: 12 Oct 2003
Posts: 78
Torrey 09 Dec 2005, 06:21
Excellent work r22! Once I get a new system (if it's 64-bit) I'll give this a test. I doubt microsoft would freak out unless the patch achieved huge popularity. There are at least two MS employees that check out the FASM boards occasionally, so there could be a chance the code could be used (most likely a small chance).
Post 09 Dec 2005, 06:21
View user's profile Send private message Visit poster's website Reply with quote
decard



Joined: 11 Sep 2003
Posts: 1092
Location: Poland
decard 09 Dec 2005, 17:12
What about 32-bit version? Are you interested in patching it too?
Post 09 Dec 2005, 17:12
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 09 Dec 2005, 19:01
NTDLL's memory functions memmove, memfill are terribly optimized, they don't even use SSE and they have a ludicrous amount of branching.

RtlMoveMemory, RtlCopyMemory, memcpy, memmove
are all the same function in ntdll.dll (64bit)
Code:
.text:0000000078EF2680 ; Exported entry 500. RtlCopyMemory
.text:0000000078EF2680 ; Exported entry 755. RtlMoveMemory
.text:0000000078EF2680 ; Exported entry 1328. memcpy
.text:0000000078EF2680 ; Exported entry 1329. memmove
.text:0000000078EF2680
.text:0000000078EF2680 ; ¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦ S U B R O U T I N E ¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦
.text:0000000078EF2680
.text:0000000078EF2680
.text:0000000078EF2680 ; void *__cdecl memmove(void *,const void *,size_t)
.text:0000000078EF2680                 public memmove
.text:0000000078EF2680 memmove         proc near               ; CODE XREF: RtlDosSearchPath_U-6BFDp
.text:0000000078EF2680                                         ; sub_78ED3BF0-1258Dp ...
.text:0000000078EF2680                 mov     r11, rcx        ; RtlCopyMemory
.text:0000000078EF2680                                         ; RtlMoveMemory
.text:0000000078EF2680                                         ; memcpy
.text:0000000078EF2683                 sub     rdx, rcx
.text:0000000078EF2686                 jbe     loc_78EF2828
.text:0000000078EF268C                 cmp     r8, 8
.text:0000000078EF2690                 jb      short loc_78EF26F4
.text:0000000078EF2692                 test    cl, 7
.text:0000000078EF2695                 jz      short loc_78EF26CE
.text:0000000078EF2697                 test    cl, 1
.text:0000000078EF269A                 jz      short loc_78EF26A8
.text:0000000078EF269C                 mov     al, [rdx+rcx]
.text:0000000078EF269F                 dec     r8
.text:0000000078EF26A2                 mov     [rcx], al
.text:0000000078EF26A4                 add     rcx, 1
.text:0000000078EF26A8
.text:0000000078EF26A8 loc_78EF26A8:                           ; CODE XREF: memmove+1Aj
.text:0000000078EF26A8                 test    cl, 2
.text:0000000078EF26AB                 jz      short loc_78EF26BC
.text:0000000078EF26AD                 mov     ax, [rdx+rcx]
.text:0000000078EF26B1                 sub     r8, 2
.text:0000000078EF26B5                 mov     [rcx], ax
.text:0000000078EF26B8                 add     rcx, 2
.text:0000000078EF26BC
.text:0000000078EF26BC loc_78EF26BC:                           ; CODE XREF: memmove+2Bj
.text:0000000078EF26BC                 test    cl, 4
.text:0000000078EF26BF                 jz      short loc_78EF26CE
.text:0000000078EF26C1                 mov     eax, [rdx+rcx]
.text:0000000078EF26C4                 sub     r8, 4
.text:0000000078EF26C8                 mov     [rcx], eax
.text:0000000078EF26CA                 add     rcx, 4
.text:0000000078EF26CE
.text:0000000078EF26CE loc_78EF26CE:                           ; CODE XREF: memmove+15j
.text:0000000078EF26CE                                         ; memmove+3Fj ...
.text:0000000078EF26CE                 mov     r9, r8
.text:0000000078EF26D1                 shr     r9, 5
.text:0000000078EF26D5                 jnz     short loc_78EF2727
.text:0000000078EF26D7
.text:0000000078EF26D7 loc_78EF26D7:                           ; CODE XREF: memmove+E0j
.text:0000000078EF26D7                 mov     r9, r8
.text:0000000078EF26DA                 shr     r9, 3
.text:0000000078EF26DE                 jz      short loc_78EF26F4
.text:0000000078EF26E0
.text:0000000078EF26E0 loc_78EF26E0:                           ; CODE XREF: memmove+6Ej
.text:0000000078EF26E0                 mov     rax, [rdx+rcx]
.text:0000000078EF26E4                 mov     [rcx], rax
.text:0000000078EF26E7                 add     rcx, 8
.text:0000000078EF26EB                 dec     r9
.text:0000000078EF26EE                 jnz     short loc_78EF26E0
.text:0000000078EF26F0                 and     r8, 7
.text:0000000078EF26F4
.text:0000000078EF26F4 loc_78EF26F4:                           ; CODE XREF: memmove+10j
.text:0000000078EF26F4                                         ; memmove+5Ej
.text:0000000078EF26F4                 test    r8, r8
.text:0000000078EF26F7                 jnz     short loc_78EF2700
.text:0000000078EF26F9                 mov     rax, r11
.text:0000000078EF26FC                 retn
.text:0000000078EF26FC ; ---------------------------------------------------------------------------
.text:0000000078EF26FD                 align 20h
.text:0000000078EF2700
.text:0000000078EF2700 loc_78EF2700:                           ; CODE XREF: memmove+77j
.text:0000000078EF2700                                         ; memmove+8Bj
.text:0000000078EF2700                 mov     al, [rdx+rcx]
.text:0000000078EF2703                 mov     [rcx], al
.text:0000000078EF2705                 inc     rcx
.text:0000000078EF2708                 dec     r8
.text:0000000078EF270B                 jnz     short loc_78EF2700
.text:0000000078EF270D                 mov     rax, r11
.text:0000000078EF2710                 retn
.text:0000000078EF2710 ; ---------------------------------------------------------------------------
.text:0000000078EF2711                 dq 9066666690666666h, 6690666690666666h
.text:0000000078EF2721                 db 2 dup(66h), 90h, 2 dup(66h), 90h
.text:0000000078EF2727 ; ---------------------------------------------------------------------------
.text:0000000078EF2727
.text:0000000078EF2727 loc_78EF2727:                           ; CODE XREF: memmove+55j
.text:0000000078EF2727                 cmp     r9, 800h
.text:0000000078EF272E                 jnb     short loc_78EF2772
.text:0000000078EF2730
.text:0000000078EF2730 loc_78EF2730:                           ; CODE XREF: memmove+DAj
.text:0000000078EF2730                                         ; memmove+F9j
.text:0000000078EF2730                 mov     rax, [rdx+rcx]
.text:0000000078EF2734                 mov     r10, [rdx+rcx+8]
.text:0000000078EF2739                 add     rcx, 20h
.text:0000000078EF273D                 mov     [rcx-20h], rax
.text:0000000078EF2741                 mov     [rcx-18h], r10
.text:0000000078EF2745                 mov     rax, [rdx+rcx-10h]
.text:0000000078EF274A                 mov     r10, [rdx+rcx-8]
.text:0000000078EF274F                 dec     r9
.text:0000000078EF2752                 mov     [rcx-10h], rax
.text:0000000078EF2756                 mov     [rcx-8], r10
.text:0000000078EF275A                 jnz     short loc_78EF2730
.text:0000000078EF275C                 and     r8, 1Fh
.text:0000000078EF2760                 jmp     loc_78EF26D7
.text:0000000078EF2760 ; ---------------------------------------------------------------------------
.text:0000000078EF2765                 dq 9066666690666666h
.text:0000000078EF276D                 db 2 dup(66h), 90h, 66h, 90h
.text:0000000078EF2772 ; ---------------------------------------------------------------------------
.text:0000000078EF2772
.text:0000000078EF2772 loc_78EF2772:                           ; CODE XREF: memmove+AEj
.text:0000000078EF2772                 cmp     rdx, 1000h
.text:0000000078EF2779                 jb      short loc_78EF2730
.text:0000000078EF277B
.text:0000000078EF277B loc_78EF277B:                           ; CODE XREF: memmove+184j
.text:0000000078EF277B                 mov     eax, 20h
.text:0000000078EF2780
.text:0000000078EF2780 loc_78EF2780:                           ; CODE XREF: memmove+112j
.text:0000000078EF2780                 prefetchnta byte ptr [rdx+rcx]
.text:0000000078EF2784                 prefetchnta byte ptr [rdx+rcx+40h]
.text:0000000078EF2789                 add     rcx, 80h
.text:0000000078EF2790                 dec     eax
.text:0000000078EF2792                 jnz     short loc_78EF2780
.text:0000000078EF2794                 sub     rcx, 1000h
.text:0000000078EF279B                 mov     eax, 40h
.text:0000000078EF27A0
.text:0000000078EF27A0 loc_78EF27A0:                           ; CODE XREF: memmove+174j
.text:0000000078EF27A0                 mov     r9, [rdx+rcx]
.text:0000000078EF27A4                 mov     r10, [rdx+rcx+8]
.text:0000000078EF27A9                 movnti  [rcx], r9
.text:0000000078EF27AD                 movnti  [rcx+8], r10
.text:0000000078EF27B2                 mov     r9, [rdx+rcx+10h]
.text:0000000078EF27B7                 mov     r10, [rdx+rcx+18h]
.text:0000000078EF27BC                 movnti  [rcx+10h], r9
.text:0000000078EF27C1                 movnti  [rcx+18h], r10
.text:0000000078EF27C6                 mov     r9, [rdx+rcx+20h]
.text:0000000078EF27CB                 mov     r10, [rdx+rcx+28h]
.text:0000000078EF27D0                 add     rcx, 40h
.text:0000000078EF27D4                 movnti  [rcx-20h], r9
.text:0000000078EF27D9                 movnti  [rcx-18h], r10
.text:0000000078EF27DE                 mov     r9, [rdx+rcx-10h]
.text:0000000078EF27E3                 mov     r10, [rdx+rcx-8]
.text:0000000078EF27E8                 dec     eax
.text:0000000078EF27EA                 movnti  [rcx-10h], r9
.text:0000000078EF27EF                 movnti  [rcx-8], r10
.text:0000000078EF27F4                 jnz     short loc_78EF27A0
.text:0000000078EF27F6                 sub     r8, 1000h
.text:0000000078EF27FD                 cmp     r8, 1000h
.text:0000000078EF2804                 jnb     loc_78EF277B
.text:0000000078EF280A                 lock or byte ptr [rsp+0], 0
.text:0000000078EF280F                 jmp     loc_78EF26CE
.text:0000000078EF280F ; ---------------------------------------------------------------------------
.text:0000000078EF2814                 dq 2 dup(9066666690666666h)
.text:0000000078EF2824                 align 8
.text:0000000078EF2828
.text:0000000078EF2828 loc_78EF2828:                           ; CODE XREF: memmove+6j
.text:0000000078EF2828                 jz      short loc_78EF2899
.text:0000000078EF282A                 add     rcx, r8
.text:0000000078EF282D                 cmp     r8, 8
.text:0000000078EF2831                 jb      short loc_78EF2894
.text:0000000078EF2833                 test    cl, 7
.text:0000000078EF2836                 jz      short loc_78EF286E
.text:0000000078EF2838                 test    cl, 1
.text:0000000078EF283B                 jz      short loc_78EF2848
.text:0000000078EF283D                 dec     rcx
.text:0000000078EF2840                 mov     al, [rdx+rcx]
.text:0000000078EF2843                 dec     r8
.text:0000000078EF2846                 mov     [rcx], al
.text:0000000078EF2848
.text:0000000078EF2848 loc_78EF2848:                           ; CODE XREF: memmove+1BBj
.text:0000000078EF2848                 test    cl, 2
.text:0000000078EF284B                 jz      short loc_78EF285C
.text:0000000078EF284D                 sub     rcx, 2
.text:0000000078EF2851                 mov     ax, [rdx+rcx]
.text:0000000078EF2855                 sub     r8, 2
.text:0000000078EF2859                 mov     [rcx], ax
.text:0000000078EF285C
.text:0000000078EF285C loc_78EF285C:                           ; CODE XREF: memmove+1CBj
.text:0000000078EF285C                 test    cl, 4
.text:0000000078EF285F                 jz      short loc_78EF286E
.text:0000000078EF2861                 sub     rcx, 4
.text:0000000078EF2865                 mov     eax, [rdx+rcx]
.text:0000000078EF2868                 sub     r8, 4
.text:0000000078EF286C                 mov     [rcx], eax
.text:0000000078EF286E
.text:0000000078EF286E loc_78EF286E:                           ; CODE XREF: memmove+1B6j
.text:0000000078EF286E                                         ; memmove+1DFj ...
.text:0000000078EF286E                 mov     r9, r8
.text:0000000078EF2871                 shr     r9, 5
.text:0000000078EF2875                 jnz     short loc_78EF28C7
.text:0000000078EF2877
.text:0000000078EF2877 loc_78EF2877:                           ; CODE XREF: memmove+27Fj
.text:0000000078EF2877                 mov     r9, r8
.text:0000000078EF287A                 shr     r9, 3
.text:0000000078EF287E                 jz      short loc_78EF2894
.text:0000000078EF2880
.text:0000000078EF2880 loc_78EF2880:                           ; CODE XREF: memmove+20Ej
.text:0000000078EF2880                 sub     rcx, 8
.text:0000000078EF2884                 mov     rax, [rdx+rcx]
.text:0000000078EF2888                 dec     r9
.text:0000000078EF288B                 mov     [rcx], rax
.text:0000000078EF288E                 jnz     short loc_78EF2880
.text:0000000078EF2890                 and     r8, 7
.text:0000000078EF2894
.text:0000000078EF2894 loc_78EF2894:                           ; CODE XREF: memmove+1B1j
.text:0000000078EF2894                                         ; memmove+1FEj
.text:0000000078EF2894                 test    r8, r8
.text:0000000078EF2897                 jnz     short loc_78EF28A0
.text:0000000078EF2899
.text:0000000078EF2899 loc_78EF2899:                           ; CODE XREF: memmove:loc_78EF2828j
.text:0000000078EF2899                 mov     rax, r11
.text:0000000078EF289C                 retn
.text:0000000078EF289C ; ---------------------------------------------------------------------------
.text:0000000078EF289D                 align 20h
.text:0000000078EF28A0
.text:0000000078EF28A0 loc_78EF28A0:                           ; CODE XREF: memmove+217j
.text:0000000078EF28A0                                         ; memmove+22Bj
.text:0000000078EF28A0                 dec     rcx
.text:0000000078EF28A3                 mov     al, [rdx+rcx]
.text:0000000078EF28A6                 dec     r8
.text:0000000078EF28A9                 mov     [rcx], al
.text:0000000078EF28AB                 jnz     short loc_78EF28A0
.text:0000000078EF28AD                 mov     rax, r11
.text:0000000078EF28B0                 retn
.text:0000000078EF28B0 ; ---------------------------------------------------------------------------
.text:0000000078EF28B1                 dq 9066666690666666h, 6690666690666666h
.text:0000000078EF28C1                 db 2 dup(66h), 90h, 2 dup(66h), 90h
.text:0000000078EF28C7 ; ---------------------------------------------------------------------------
.text:0000000078EF28C7
.text:0000000078EF28C7 loc_78EF28C7:                           ; CODE XREF: memmove+1F5j
.text:0000000078EF28C7                 cmp     r9, 800h
.text:0000000078EF28CE                 jnb     short loc_78EF2912
.text:0000000078EF28D0
.text:0000000078EF28D0 loc_78EF28D0:                           ; CODE XREF: memmove+279j
.text:0000000078EF28D0                                         ; memmove+299j
.text:0000000078EF28D0                 mov     rax, [rdx+rcx-8]
.text:0000000078EF28D5                 mov     r10, [rdx+rcx-10h]
.text:0000000078EF28DA                 sub     rcx, 20h
.text:0000000078EF28DE                 mov     [rcx+18h], rax
.text:0000000078EF28E2                 mov     [rcx+10h], r10
.text:0000000078EF28E6                 mov     rax, [rdx+rcx+8]
.text:0000000078EF28EB                 mov     r10, [rdx+rcx]
.text:0000000078EF28EF                 dec     r9
.text:0000000078EF28F2                 mov     [rcx+8], rax
.text:0000000078EF28F6                 mov     [rcx], r10
.text:0000000078EF28F9                 jnz     short loc_78EF28D0
.text:0000000078EF28FB                 and     r8, 1Fh
.text:0000000078EF28FF                 jmp     loc_78EF2877
.text:0000000078EF28FF ; ---------------------------------------------------------------------------
.text:0000000078EF2904                 dq 9066666690666666h
.text:0000000078EF290C                 db 3 dup(66h), 90h, 66h, 90h
.text:0000000078EF2912 ; ---------------------------------------------------------------------------
.text:0000000078EF2912
.text:0000000078EF2912 loc_78EF2912:                           ; CODE XREF: memmove+24Ej
.text:0000000078EF2912                 cmp     rdx, 0FFFFFFFFFFFFF000h
.text:0000000078EF2919                 ja      short loc_78EF28D0
.text:0000000078EF291B
.text:0000000078EF291B loc_78EF291B:                           ; CODE XREF: memmove+324j
.text:0000000078EF291B                 mov     eax, 20h
.text:0000000078EF2920
.text:0000000078EF2920 loc_78EF2920:                           ; CODE XREF: memmove+2B2j
.text:0000000078EF2920                 sub     rcx, 80h
.text:0000000078EF2927                 prefetchnta byte ptr [rdx+rcx]
.text:0000000078EF292B                 prefetchnta byte ptr [rdx+rcx+40h]
.text:0000000078EF2930                 dec     eax
.text:0000000078EF2932                 jnz     short loc_78EF2920
.text:0000000078EF2934                 add     rcx, 1000h
.text:0000000078EF293B                 mov     eax, 40h
.text:0000000078EF2940
.text:0000000078EF2940 loc_78EF2940:                           ; CODE XREF: memmove+314j
.text:0000000078EF2940                 mov     r9, [rdx+rcx-8]
.text:0000000078EF2945                 mov     r10, [rdx+rcx-10h]
.text:0000000078EF294A                 movnti  [rcx-8], r9
.text:0000000078EF294F                 movnti  [rcx-10h], r10
.text:0000000078EF2954                 mov     r9, [rdx+rcx-18h]
.text:0000000078EF2959                 mov     r10, [rdx+rcx-20h]
.text:0000000078EF295E                 movnti  [rcx-18h], r9
.text:0000000078EF2963                 movnti  [rcx-20h], r10
.text:0000000078EF2968                 mov     r9, [rdx+rcx-28h]
.text:0000000078EF296D                 mov     r10, [rdx+rcx-30h]
.text:0000000078EF2972                 sub     rcx, 40h
.text:0000000078EF2976                 movnti  [rcx+18h], r9
.text:0000000078EF297B                 movnti  [rcx+10h], r10
.text:0000000078EF2980                 mov     r9, [rdx+rcx+8]
.text:0000000078EF2985                 mov     r10, [rdx+rcx]
.text:0000000078EF2989                 dec     eax
.text:0000000078EF298B                 movnti  [rcx+8], r9
.text:0000000078EF2990                 movnti  [rcx], r10
.text:0000000078EF2994                 jnz     short loc_78EF2940
.text:0000000078EF2996                 sub     r8, 1000h
.text:0000000078EF299D                 cmp     r8, 1000h
.text:0000000078EF29A4                 jnb     loc_78EF291B
.text:0000000078EF29AA                 lock or byte ptr [rsp+0], 0
.text:0000000078EF29AF                 jmp     loc_78EF286E
    


It takes a few minutes to follow, because they tried to optimized by using diferent loops for different size ranges. The code isn't bad code but neglecting to use SSE in this case seems foolish. M$ even unrolled the loops for the large sizes. The interesting use of prefetch 4096 bytes at a time.

My problem is how to rework this function.
>Should I use a byte copy loop until I get the Source 16byte aligned and then use a MOVDQA from the source and a MOVDQU to the dest. Then at the end another byte copy loop to get the size modulus 16 bytes leftover.
>Or should I just use MOVDQU for both and then use a byte copy loop for the size modulus 16 at the end.

It's a difficult rework because you don't know how the function is going to be used. You have to make sure 10000 calls for moving only a few bytes works optimized as well as the calls that want to move a lot of memory.

Maybe a jump table for sizes from [0-15] then just use sse until the size gets small enough to use the functions in the jmp table.

This is going to take a lot of benchmarking.
Post 09 Dec 2005, 19:01
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Matrix



Joined: 04 Sep 2004
Posts: 1166
Location: Overflow
Matrix 09 Dec 2005, 21:55
r22

>Should I use a byte copy loop until I get the Source 16byte aligned and then use a MOVDQA from the source and a MOVDQU to the dest.

i wote on this idea, and of course do pre fetching, read cache lines might be the best, but maeby prefetchnta could do it,
allignment difficulties could occour when lets say, source buffer is aligned and destination is not, unfortunatly, but this is rarely the case.

i do not have an 64 bit comp to test it ;<

btw there is a fast memory move thread in main Wink
Post 09 Dec 2005, 21:55
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 10 Dec 2005, 07:19
Here's my first crack at a memmove function.
It's not tested yet.

Code:
RtlMoveMemory2: ;;;rcx = dest rdx = src r8 = size
                cmp    rdx,rcx
                jz     .end
                test   r8,r8
                jz     .end
                cmp    r8,600h
                ja     .SSE
       .copy:
                sub    r8,16
                js     .clean
                mov    rax,[rdx+r8]
                mov    r9,[rdx+r8+8]
                mov    [rcx+r8],rax
                mov    [rcx+r8+8],r9
                jmp    .copy
       .clean:
                add    r8,16
                jz     .end
                dec    r8
       .bcopy:
                mov    al,[rdx+r8]
                dec    r8
                mov    [rcx+r8+1],al
                jz     .end
                jmp    .bcopy
       .SSE:
                mov    r9,r8
                mov    eax,20h
                sub    r9,4096
       .pref:
                prefetchnta byte [rdx+r9]
                prefetchnta byte [rdx+r9+40h]
                add    r9,80h
                dec    eax
                jnz    .pref
                mov    eax,20h
       .SSECopy:
                movdqu  xmm0,[rdx+r8-16]
                movdqu  xmm1,[rdx+r8-32]
                movdqu  xmm2,[rdx+r8-48]
                movdqu  xmm3,[rdx+r8-64]
                movdqu  xmm4,[rdx+r8-80]
                movdqu  xmm5,[rdx+r8-96]
                movdqu  xmm6,[rdx+r8-112]
                movdqu  xmm7,[rdx+r8-128]
                movdqu  [rcx+r8-16],xmm0
                movdqu  [rcx+r8-32],xmm1
                movdqu  [rcx+r8-48],xmm2
                movdqu  [rcx+r8-64],xmm3
                movdqu  [rcx+r8-80],xmm4
                movdqu  [rcx+r8-96],xmm5
                movdqu  [rcx+r8-112],xmm6
                movdqu  [rcx+r8-128],xmm7
                sub     r8,128
                dec     eax
                jnz     .SSECopy
                cmp     r8,4112
                jnb      .SSE
                jmp     .copy
       .end:
                mov    rax,rcx
                ret 0
    
Post 10 Dec 2005, 07:19
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Matrix



Joined: 04 Sep 2004
Posts: 1166
Location: Overflow
Matrix 10 Dec 2005, 15:18
r22,
neat & tidy code Cool

hope its the best,

btw i gotto buy a comp...
Post 10 Dec 2005, 15:18
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 10 Dec 2005, 18:44
Aren't the 64 and 32-bit variants of Intel CPUs equally priced? You can even buy a 64-bit Celeron for about $100 to test this code Wink
Post 10 Dec 2005, 18:44
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 10 Dec 2005, 23:00
It seems for unaligned data using MOVDQU isn't the way to go in 64bit land.

MOVDQU xmmx,mem128 latency 9
LDDQU xmmx,mem128 latency 7
MOV reg64,mem64 latency 3

Well now I know why ntdll's memmove doesn't use SSE.
Post 10 Dec 2005, 23:00
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
shism2



Joined: 14 Sep 2005
Posts: 248
shism2 11 Dec 2005, 03:16
How about make an optimized for 32-bit version of windows xp Smile?
Post 11 Dec 2005, 03:16
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 11 Dec 2005, 04:50
Trying to beat microsofts compiler for this memmove function will take a little longer than expected. I have to admit this function was well coded, but there's always room for improvement.

I'm not that familiar with prefetch and the non temporial stuff.

Once I'm done I'll think about patching the 32bit ntdll.dll pretty much the same two functions RtlInitUnicodeString and RtlMoveMemory.

But what I really want to do is patch the NTOSKRNL.EXE because those functions are used by drivers so I can expect a really big performance boost. From initial decompiles it has almost an exact copy of the ntdll functions in it.
Post 11 Dec 2005, 04:50
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3, 4  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.