flat assembler
Message board for the users of flat assembler.
Index
> Windows > Patching system dlls Goto page Previous 1, 2, 3, 4 Next |
Author |
|
r22 07 Dec 2005, 06:14
disabled chkdsk, got rid of the dllcache folder.
But there's no method for disabling SFC for win64, the only things I've found are how to disable it for winxp and xp sp2(which uses the older 32bit dll sfc_os). Another annoying fact about win64, you can't install the recovery console to the HD to make it bootable, so every time I mess around I have to wait for the CD recovery console to copy files and initialize. I need to learn how MS updates their dll files avoiding all this mess. |
|||
07 Dec 2005, 06:14 |
|
gunblade 07 Dec 2005, 12:25
Well, you could download a service pack and examine the differences between the old and new dll's. Although there might not be a service pack for win64 yet. Another thing, have you checked the code actually works fine for all input and such? Maybe windows uses the new function for something during bootup, and if theres a bug, then it causes it to crash, and therefore the BSOD. Not doubting your coding correctness, it's just a possibility
|
|||
07 Dec 2005, 12:25 |
|
Matrix 07 Dec 2005, 16:17
you must be really bored if you are willing to patch windows dlls ;>>
its a never ending job |
|||
07 Dec 2005, 16:17 |
|
Torrey 08 Dec 2005, 04:40
I haven't worked with a 64-bit system yet, but is there 64-bit code within ntdll.dll ? I'm also not certain of this, but could it be possible that the OS boots in 32-bit protected mode and uses the ntdll.dll before switching over to 64-bit? This would cause dll problems since those opcodes aren't available then.
[edit] Quote: WOW64 Implementation Details By reading the above it sounds like there are two different versions of ntdll.dll on a 64-bit system. |
|||
08 Dec 2005, 04:40 |
|
gunblade 08 Dec 2005, 12:59
You may be on to something Torrey,
r22, see if you can convert your code to 32-bit, just for the moment being, and put that into the dll. Then see if that causes the same problem as before (remember to check the CRC and such). Or find a way to print out what mode the cpu is running in. |
|||
08 Dec 2005, 12:59 |
|
lilljocke 08 Dec 2005, 17:15
Hi i like your idea actually i have thought about it my self. I don't know how to remove the Windows file protection but i know a simple DOS program that can do it, See the attachment. I hope that program can help you, you can debug it and see what it do. I wonder if i could help you with something becouse I am interested in this conversion thing. I know MASM and FASM.
|
|||||||||||
08 Dec 2005, 17:15 |
|
r22 09 Dec 2005, 02:54
Windows/system32 holds all the 64bit versions of the system dlls
while windows/wow64 holds all the 32bit versions I'm aware of the ambiguity in the names. So yes there are two ntdll.dll files one in each directory with essentially the same functions but different code 64bit in system32/ntdll.dll and 32bit in wow64/ntdll.dll. Right now I'm going to edit 1byte in the 64bit ntdll.dll, hex edit the correct CRC in the header. If I can get this to work with my current replacement method then I'll have to go back to testing the integrity of my replacement functions code. Although my initial tests showed my RtlInitUnicodeString function performing exactly the same as ntdlls (only 2.2x faster) in every situation I could come up with. |
|||
09 Dec 2005, 02:54 |
|
r22 09 Dec 2005, 03:26
*edit for the above post windows/wow64/ I meant windows/SysWow64 directory.
Ok I successfully nop nop 'd a mov Rax,Rax in the ntdll, fixed the checksum in the head and replaced it with the original, restarted and it is still there. So there must be an issue with how I either hard patched the whole function OR my function in general. Working NTDLL.DLL RtlInitUnicodeString Code: .text:0000000078EF4120 nop .text:0000000078EF4121 nop .text:0000000078EF4122 push rdi .text:0000000078EF4123 test rdx, rdx .text:0000000078EF4126 mov r8, rcx .text:0000000078EF4129 mov word ptr [rcx+2], 0 .text:0000000078EF412F mov word ptr [rcx], 0 .text:0000000078EF4134 mov [rcx+8], rdx .text:0000000078EF4138 jz short loc_78EF4172 .text:0000000078EF413A xor eax, eax .text:0000000078EF413C mov rcx, 0FFFFFFFFFFFFFFFFh .text:0000000078EF4146 mov rdi, rdx .text:0000000078EF4149 repne scasw .text:0000000078EF414C mov eax, 0FFFCh .text:0000000078EF4151 not rcx .text:0000000078EF4154 dec rcx .text:0000000078EF4157 add rcx, rcx .text:0000000078EF415A cmp rcx, 0FFFEh .text:0000000078EF4161 cmovnb rcx, rax .text:0000000078EF4165 mov [r8], cx .text:0000000078EF4169 add cx, 2 .text:0000000078EF416D mov [r8+2], cx .text:0000000078EF4172 .text:0000000078EF4172 loc_78EF4172: ; CODE XREF: RtlInitUnicodeString+18j .text:0000000078EF4172 pop rdi .text:0000000078EF4173 retn My ntdll.dll RtlInitUnicodeString (this one freezes at bootup Code: .text:0000000078EF4120 xor r9d, r9d .text:0000000078EF4123 test rdx, rdx .text:0000000078EF4126 mov [rcx], r9d .text:0000000078EF4129 mov [rcx+8], rdx .text:0000000078EF412D jz short locret_78EF4174 .text:0000000078EF412F mov r8, rdx .text:0000000078EF4132 .text:0000000078EF4132 loc_78EF4132: ; CODE XREF: RtlInitUnicodeString+28j .text:0000000078EF4132 movzx eax, word ptr [rdx] .text:0000000078EF4135 add rdx, 2 .text:0000000078EF4139 test eax, eax .text:0000000078EF413B jz short loc_78EF4150 .text:0000000078EF413D movzx eax, word ptr [rdx] .text:0000000078EF4140 add rdx, 2 .text:0000000078EF4144 test eax, eax .text:0000000078EF4146 jz short loc_78EF4150 .text:0000000078EF4148 jmp short loc_78EF4132 .text:0000000078EF4148 ; --------------------------------------------------------------------------- .text:0000000078EF414A align 10h .text:0000000078EF4150 .text:0000000078EF4150 loc_78EF4150: ; CODE XREF: RtlInitUnicodeString+1Bj .text:0000000078EF4150 ; RtlInitUnicodeString+26j .text:0000000078EF4150 sub rdx, 2 .text:0000000078EF4154 mov eax, 0FFFCh .text:0000000078EF4159 sub rdx, r8 .text:0000000078EF415C cmp rdx, 0FFFEh .text:0000000078EF4163 cmovnb rdx, rax .text:0000000078EF4167 mov [ecx], dx .text:0000000078EF416B add dx, 2 .text:0000000078EF416F mov [ecx+2], dx .text:0000000078EF4174 .text:0000000078EF4174 locret_78EF4174: ; CODE XREF: RtlInitUnicodeString+Dj .text:0000000078EF4174 retn From testing both these functions they should perform the same task in ALL circumstances, maybe using R9D in my function causes problems some how, but I thought R9D didnt have to be saved. Now that I code one replace to FINALLY work I'll just incrementally change the bytes until I can get my faster function (then functions) working on my system. You can all be sure that I'll report any insane gains in system performance . OMG plain as day did you catch it MOV [ECX], DX instead of RCX LOL no wonder my system froze up so hard trying to run that code, how did this code work flawlessly in my tests ?! |
|||
09 Dec 2005, 03:26 |
|
r22 09 Dec 2005, 05:55
Apparently the same error checking built into win64 that replaces movdqa with movdqu to make it work does the same with mov [ecx],blah to mov [rcx],blah so while my errored function worked fine while I was testing it, during the boot up process it would fail miserably.
=================================== HOW TO PATCH WIN XP 64BIT SYSTEM DLLS =================================== 1. Disable CHKDSK chkntfs /x c: replace c: with whatever drive your win64 is installed on 2. Disable (well give warning message) for SFC Rename your dllcache folder in windows/system32/ to bakdllcache or delete it your risk. Start | run | regedt32 HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Winlogon Set key SFCDisable REG_DWORD to 0xffffff9d Set key SfcQuota REG_DWORD to 0 3. Shutdown and restart for good measure. 4. Go to your windows/system32/ folder and rename the dll your replacing to dllname.dll.bak (so if you kill your box you can fix it with the recovery console) right after the rename PASTE your modified dll into the directory. 5. SFC will now give you a warning popup message Click cancel then Click Yes. 6. Shutdown and Restart and hopefully your windows 64 will load. If not recovery console and delete the modified file and rename the .bak file to the original name and restart. ====================================== Why modify your precious win64 kernel ????????? ====================================== JUST FROM REPLACING RTLINITUNICODESTRING WITH A FUNCTION THAT RUNS 2X FASTER, MY SYSTEM NOW BOOTS UP 10% FASTER AND 64BIT APPLICATIONS LOAD 10% FASTER. The 10% is an estimate (on the low side since I have no empiracle way of testing). Why RtlInitUnicodeString???????????????????????? Well when you have free time like me you decompile windows/system32/kernel32.dll and relize all the suffix A (ascii) functions call the suffix W (wide char) functions and all the suffix W functions call RtlInitUnicodeString. So by replacing this function with a faster one your essentially optimizing ~50% of the functions in your kernel. The file, has a correct checksum and if you run a difference comparison on this file and your ntdll.dll (windows/system32/) you'll see only the 3bytes for the header checksum are changed and the RtlInitUnicode function is changed.
|
|||||||||||
09 Dec 2005, 05:55 |
|
Torrey 09 Dec 2005, 06:21
Excellent work r22! Once I get a new system (if it's 64-bit) I'll give this a test. I doubt microsoft would freak out unless the patch achieved huge popularity. There are at least two MS employees that check out the FASM boards occasionally, so there could be a chance the code could be used (most likely a small chance).
|
|||
09 Dec 2005, 06:21 |
|
decard 09 Dec 2005, 17:12
What about 32-bit version? Are you interested in patching it too?
|
|||
09 Dec 2005, 17:12 |
|
r22 09 Dec 2005, 19:01
NTDLL's memory functions memmove, memfill are terribly optimized, they don't even use SSE and they have a ludicrous amount of branching.
RtlMoveMemory, RtlCopyMemory, memcpy, memmove are all the same function in ntdll.dll (64bit) Code: .text:0000000078EF2680 ; Exported entry 500. RtlCopyMemory .text:0000000078EF2680 ; Exported entry 755. RtlMoveMemory .text:0000000078EF2680 ; Exported entry 1328. memcpy .text:0000000078EF2680 ; Exported entry 1329. memmove .text:0000000078EF2680 .text:0000000078EF2680 ; ¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦ S U B R O U T I N E ¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦ .text:0000000078EF2680 .text:0000000078EF2680 .text:0000000078EF2680 ; void *__cdecl memmove(void *,const void *,size_t) .text:0000000078EF2680 public memmove .text:0000000078EF2680 memmove proc near ; CODE XREF: RtlDosSearchPath_U-6BFDp .text:0000000078EF2680 ; sub_78ED3BF0-1258Dp ... .text:0000000078EF2680 mov r11, rcx ; RtlCopyMemory .text:0000000078EF2680 ; RtlMoveMemory .text:0000000078EF2680 ; memcpy .text:0000000078EF2683 sub rdx, rcx .text:0000000078EF2686 jbe loc_78EF2828 .text:0000000078EF268C cmp r8, 8 .text:0000000078EF2690 jb short loc_78EF26F4 .text:0000000078EF2692 test cl, 7 .text:0000000078EF2695 jz short loc_78EF26CE .text:0000000078EF2697 test cl, 1 .text:0000000078EF269A jz short loc_78EF26A8 .text:0000000078EF269C mov al, [rdx+rcx] .text:0000000078EF269F dec r8 .text:0000000078EF26A2 mov [rcx], al .text:0000000078EF26A4 add rcx, 1 .text:0000000078EF26A8 .text:0000000078EF26A8 loc_78EF26A8: ; CODE XREF: memmove+1Aj .text:0000000078EF26A8 test cl, 2 .text:0000000078EF26AB jz short loc_78EF26BC .text:0000000078EF26AD mov ax, [rdx+rcx] .text:0000000078EF26B1 sub r8, 2 .text:0000000078EF26B5 mov [rcx], ax .text:0000000078EF26B8 add rcx, 2 .text:0000000078EF26BC .text:0000000078EF26BC loc_78EF26BC: ; CODE XREF: memmove+2Bj .text:0000000078EF26BC test cl, 4 .text:0000000078EF26BF jz short loc_78EF26CE .text:0000000078EF26C1 mov eax, [rdx+rcx] .text:0000000078EF26C4 sub r8, 4 .text:0000000078EF26C8 mov [rcx], eax .text:0000000078EF26CA add rcx, 4 .text:0000000078EF26CE .text:0000000078EF26CE loc_78EF26CE: ; CODE XREF: memmove+15j .text:0000000078EF26CE ; memmove+3Fj ... .text:0000000078EF26CE mov r9, r8 .text:0000000078EF26D1 shr r9, 5 .text:0000000078EF26D5 jnz short loc_78EF2727 .text:0000000078EF26D7 .text:0000000078EF26D7 loc_78EF26D7: ; CODE XREF: memmove+E0j .text:0000000078EF26D7 mov r9, r8 .text:0000000078EF26DA shr r9, 3 .text:0000000078EF26DE jz short loc_78EF26F4 .text:0000000078EF26E0 .text:0000000078EF26E0 loc_78EF26E0: ; CODE XREF: memmove+6Ej .text:0000000078EF26E0 mov rax, [rdx+rcx] .text:0000000078EF26E4 mov [rcx], rax .text:0000000078EF26E7 add rcx, 8 .text:0000000078EF26EB dec r9 .text:0000000078EF26EE jnz short loc_78EF26E0 .text:0000000078EF26F0 and r8, 7 .text:0000000078EF26F4 .text:0000000078EF26F4 loc_78EF26F4: ; CODE XREF: memmove+10j .text:0000000078EF26F4 ; memmove+5Ej .text:0000000078EF26F4 test r8, r8 .text:0000000078EF26F7 jnz short loc_78EF2700 .text:0000000078EF26F9 mov rax, r11 .text:0000000078EF26FC retn .text:0000000078EF26FC ; --------------------------------------------------------------------------- .text:0000000078EF26FD align 20h .text:0000000078EF2700 .text:0000000078EF2700 loc_78EF2700: ; CODE XREF: memmove+77j .text:0000000078EF2700 ; memmove+8Bj .text:0000000078EF2700 mov al, [rdx+rcx] .text:0000000078EF2703 mov [rcx], al .text:0000000078EF2705 inc rcx .text:0000000078EF2708 dec r8 .text:0000000078EF270B jnz short loc_78EF2700 .text:0000000078EF270D mov rax, r11 .text:0000000078EF2710 retn .text:0000000078EF2710 ; --------------------------------------------------------------------------- .text:0000000078EF2711 dq 9066666690666666h, 6690666690666666h .text:0000000078EF2721 db 2 dup(66h), 90h, 2 dup(66h), 90h .text:0000000078EF2727 ; --------------------------------------------------------------------------- .text:0000000078EF2727 .text:0000000078EF2727 loc_78EF2727: ; CODE XREF: memmove+55j .text:0000000078EF2727 cmp r9, 800h .text:0000000078EF272E jnb short loc_78EF2772 .text:0000000078EF2730 .text:0000000078EF2730 loc_78EF2730: ; CODE XREF: memmove+DAj .text:0000000078EF2730 ; memmove+F9j .text:0000000078EF2730 mov rax, [rdx+rcx] .text:0000000078EF2734 mov r10, [rdx+rcx+8] .text:0000000078EF2739 add rcx, 20h .text:0000000078EF273D mov [rcx-20h], rax .text:0000000078EF2741 mov [rcx-18h], r10 .text:0000000078EF2745 mov rax, [rdx+rcx-10h] .text:0000000078EF274A mov r10, [rdx+rcx-8] .text:0000000078EF274F dec r9 .text:0000000078EF2752 mov [rcx-10h], rax .text:0000000078EF2756 mov [rcx-8], r10 .text:0000000078EF275A jnz short loc_78EF2730 .text:0000000078EF275C and r8, 1Fh .text:0000000078EF2760 jmp loc_78EF26D7 .text:0000000078EF2760 ; --------------------------------------------------------------------------- .text:0000000078EF2765 dq 9066666690666666h .text:0000000078EF276D db 2 dup(66h), 90h, 66h, 90h .text:0000000078EF2772 ; --------------------------------------------------------------------------- .text:0000000078EF2772 .text:0000000078EF2772 loc_78EF2772: ; CODE XREF: memmove+AEj .text:0000000078EF2772 cmp rdx, 1000h .text:0000000078EF2779 jb short loc_78EF2730 .text:0000000078EF277B .text:0000000078EF277B loc_78EF277B: ; CODE XREF: memmove+184j .text:0000000078EF277B mov eax, 20h .text:0000000078EF2780 .text:0000000078EF2780 loc_78EF2780: ; CODE XREF: memmove+112j .text:0000000078EF2780 prefetchnta byte ptr [rdx+rcx] .text:0000000078EF2784 prefetchnta byte ptr [rdx+rcx+40h] .text:0000000078EF2789 add rcx, 80h .text:0000000078EF2790 dec eax .text:0000000078EF2792 jnz short loc_78EF2780 .text:0000000078EF2794 sub rcx, 1000h .text:0000000078EF279B mov eax, 40h .text:0000000078EF27A0 .text:0000000078EF27A0 loc_78EF27A0: ; CODE XREF: memmove+174j .text:0000000078EF27A0 mov r9, [rdx+rcx] .text:0000000078EF27A4 mov r10, [rdx+rcx+8] .text:0000000078EF27A9 movnti [rcx], r9 .text:0000000078EF27AD movnti [rcx+8], r10 .text:0000000078EF27B2 mov r9, [rdx+rcx+10h] .text:0000000078EF27B7 mov r10, [rdx+rcx+18h] .text:0000000078EF27BC movnti [rcx+10h], r9 .text:0000000078EF27C1 movnti [rcx+18h], r10 .text:0000000078EF27C6 mov r9, [rdx+rcx+20h] .text:0000000078EF27CB mov r10, [rdx+rcx+28h] .text:0000000078EF27D0 add rcx, 40h .text:0000000078EF27D4 movnti [rcx-20h], r9 .text:0000000078EF27D9 movnti [rcx-18h], r10 .text:0000000078EF27DE mov r9, [rdx+rcx-10h] .text:0000000078EF27E3 mov r10, [rdx+rcx-8] .text:0000000078EF27E8 dec eax .text:0000000078EF27EA movnti [rcx-10h], r9 .text:0000000078EF27EF movnti [rcx-8], r10 .text:0000000078EF27F4 jnz short loc_78EF27A0 .text:0000000078EF27F6 sub r8, 1000h .text:0000000078EF27FD cmp r8, 1000h .text:0000000078EF2804 jnb loc_78EF277B .text:0000000078EF280A lock or byte ptr [rsp+0], 0 .text:0000000078EF280F jmp loc_78EF26CE .text:0000000078EF280F ; --------------------------------------------------------------------------- .text:0000000078EF2814 dq 2 dup(9066666690666666h) .text:0000000078EF2824 align 8 .text:0000000078EF2828 .text:0000000078EF2828 loc_78EF2828: ; CODE XREF: memmove+6j .text:0000000078EF2828 jz short loc_78EF2899 .text:0000000078EF282A add rcx, r8 .text:0000000078EF282D cmp r8, 8 .text:0000000078EF2831 jb short loc_78EF2894 .text:0000000078EF2833 test cl, 7 .text:0000000078EF2836 jz short loc_78EF286E .text:0000000078EF2838 test cl, 1 .text:0000000078EF283B jz short loc_78EF2848 .text:0000000078EF283D dec rcx .text:0000000078EF2840 mov al, [rdx+rcx] .text:0000000078EF2843 dec r8 .text:0000000078EF2846 mov [rcx], al .text:0000000078EF2848 .text:0000000078EF2848 loc_78EF2848: ; CODE XREF: memmove+1BBj .text:0000000078EF2848 test cl, 2 .text:0000000078EF284B jz short loc_78EF285C .text:0000000078EF284D sub rcx, 2 .text:0000000078EF2851 mov ax, [rdx+rcx] .text:0000000078EF2855 sub r8, 2 .text:0000000078EF2859 mov [rcx], ax .text:0000000078EF285C .text:0000000078EF285C loc_78EF285C: ; CODE XREF: memmove+1CBj .text:0000000078EF285C test cl, 4 .text:0000000078EF285F jz short loc_78EF286E .text:0000000078EF2861 sub rcx, 4 .text:0000000078EF2865 mov eax, [rdx+rcx] .text:0000000078EF2868 sub r8, 4 .text:0000000078EF286C mov [rcx], eax .text:0000000078EF286E .text:0000000078EF286E loc_78EF286E: ; CODE XREF: memmove+1B6j .text:0000000078EF286E ; memmove+1DFj ... .text:0000000078EF286E mov r9, r8 .text:0000000078EF2871 shr r9, 5 .text:0000000078EF2875 jnz short loc_78EF28C7 .text:0000000078EF2877 .text:0000000078EF2877 loc_78EF2877: ; CODE XREF: memmove+27Fj .text:0000000078EF2877 mov r9, r8 .text:0000000078EF287A shr r9, 3 .text:0000000078EF287E jz short loc_78EF2894 .text:0000000078EF2880 .text:0000000078EF2880 loc_78EF2880: ; CODE XREF: memmove+20Ej .text:0000000078EF2880 sub rcx, 8 .text:0000000078EF2884 mov rax, [rdx+rcx] .text:0000000078EF2888 dec r9 .text:0000000078EF288B mov [rcx], rax .text:0000000078EF288E jnz short loc_78EF2880 .text:0000000078EF2890 and r8, 7 .text:0000000078EF2894 .text:0000000078EF2894 loc_78EF2894: ; CODE XREF: memmove+1B1j .text:0000000078EF2894 ; memmove+1FEj .text:0000000078EF2894 test r8, r8 .text:0000000078EF2897 jnz short loc_78EF28A0 .text:0000000078EF2899 .text:0000000078EF2899 loc_78EF2899: ; CODE XREF: memmove:loc_78EF2828j .text:0000000078EF2899 mov rax, r11 .text:0000000078EF289C retn .text:0000000078EF289C ; --------------------------------------------------------------------------- .text:0000000078EF289D align 20h .text:0000000078EF28A0 .text:0000000078EF28A0 loc_78EF28A0: ; CODE XREF: memmove+217j .text:0000000078EF28A0 ; memmove+22Bj .text:0000000078EF28A0 dec rcx .text:0000000078EF28A3 mov al, [rdx+rcx] .text:0000000078EF28A6 dec r8 .text:0000000078EF28A9 mov [rcx], al .text:0000000078EF28AB jnz short loc_78EF28A0 .text:0000000078EF28AD mov rax, r11 .text:0000000078EF28B0 retn .text:0000000078EF28B0 ; --------------------------------------------------------------------------- .text:0000000078EF28B1 dq 9066666690666666h, 6690666690666666h .text:0000000078EF28C1 db 2 dup(66h), 90h, 2 dup(66h), 90h .text:0000000078EF28C7 ; --------------------------------------------------------------------------- .text:0000000078EF28C7 .text:0000000078EF28C7 loc_78EF28C7: ; CODE XREF: memmove+1F5j .text:0000000078EF28C7 cmp r9, 800h .text:0000000078EF28CE jnb short loc_78EF2912 .text:0000000078EF28D0 .text:0000000078EF28D0 loc_78EF28D0: ; CODE XREF: memmove+279j .text:0000000078EF28D0 ; memmove+299j .text:0000000078EF28D0 mov rax, [rdx+rcx-8] .text:0000000078EF28D5 mov r10, [rdx+rcx-10h] .text:0000000078EF28DA sub rcx, 20h .text:0000000078EF28DE mov [rcx+18h], rax .text:0000000078EF28E2 mov [rcx+10h], r10 .text:0000000078EF28E6 mov rax, [rdx+rcx+8] .text:0000000078EF28EB mov r10, [rdx+rcx] .text:0000000078EF28EF dec r9 .text:0000000078EF28F2 mov [rcx+8], rax .text:0000000078EF28F6 mov [rcx], r10 .text:0000000078EF28F9 jnz short loc_78EF28D0 .text:0000000078EF28FB and r8, 1Fh .text:0000000078EF28FF jmp loc_78EF2877 .text:0000000078EF28FF ; --------------------------------------------------------------------------- .text:0000000078EF2904 dq 9066666690666666h .text:0000000078EF290C db 3 dup(66h), 90h, 66h, 90h .text:0000000078EF2912 ; --------------------------------------------------------------------------- .text:0000000078EF2912 .text:0000000078EF2912 loc_78EF2912: ; CODE XREF: memmove+24Ej .text:0000000078EF2912 cmp rdx, 0FFFFFFFFFFFFF000h .text:0000000078EF2919 ja short loc_78EF28D0 .text:0000000078EF291B .text:0000000078EF291B loc_78EF291B: ; CODE XREF: memmove+324j .text:0000000078EF291B mov eax, 20h .text:0000000078EF2920 .text:0000000078EF2920 loc_78EF2920: ; CODE XREF: memmove+2B2j .text:0000000078EF2920 sub rcx, 80h .text:0000000078EF2927 prefetchnta byte ptr [rdx+rcx] .text:0000000078EF292B prefetchnta byte ptr [rdx+rcx+40h] .text:0000000078EF2930 dec eax .text:0000000078EF2932 jnz short loc_78EF2920 .text:0000000078EF2934 add rcx, 1000h .text:0000000078EF293B mov eax, 40h .text:0000000078EF2940 .text:0000000078EF2940 loc_78EF2940: ; CODE XREF: memmove+314j .text:0000000078EF2940 mov r9, [rdx+rcx-8] .text:0000000078EF2945 mov r10, [rdx+rcx-10h] .text:0000000078EF294A movnti [rcx-8], r9 .text:0000000078EF294F movnti [rcx-10h], r10 .text:0000000078EF2954 mov r9, [rdx+rcx-18h] .text:0000000078EF2959 mov r10, [rdx+rcx-20h] .text:0000000078EF295E movnti [rcx-18h], r9 .text:0000000078EF2963 movnti [rcx-20h], r10 .text:0000000078EF2968 mov r9, [rdx+rcx-28h] .text:0000000078EF296D mov r10, [rdx+rcx-30h] .text:0000000078EF2972 sub rcx, 40h .text:0000000078EF2976 movnti [rcx+18h], r9 .text:0000000078EF297B movnti [rcx+10h], r10 .text:0000000078EF2980 mov r9, [rdx+rcx+8] .text:0000000078EF2985 mov r10, [rdx+rcx] .text:0000000078EF2989 dec eax .text:0000000078EF298B movnti [rcx+8], r9 .text:0000000078EF2990 movnti [rcx], r10 .text:0000000078EF2994 jnz short loc_78EF2940 .text:0000000078EF2996 sub r8, 1000h .text:0000000078EF299D cmp r8, 1000h .text:0000000078EF29A4 jnb loc_78EF291B .text:0000000078EF29AA lock or byte ptr [rsp+0], 0 .text:0000000078EF29AF jmp loc_78EF286E It takes a few minutes to follow, because they tried to optimized by using diferent loops for different size ranges. The code isn't bad code but neglecting to use SSE in this case seems foolish. M$ even unrolled the loops for the large sizes. The interesting use of prefetch 4096 bytes at a time. My problem is how to rework this function. >Should I use a byte copy loop until I get the Source 16byte aligned and then use a MOVDQA from the source and a MOVDQU to the dest. Then at the end another byte copy loop to get the size modulus 16 bytes leftover. >Or should I just use MOVDQU for both and then use a byte copy loop for the size modulus 16 at the end. It's a difficult rework because you don't know how the function is going to be used. You have to make sure 10000 calls for moving only a few bytes works optimized as well as the calls that want to move a lot of memory. Maybe a jump table for sizes from [0-15] then just use sse until the size gets small enough to use the functions in the jmp table. This is going to take a lot of benchmarking. |
|||
09 Dec 2005, 19:01 |
|
Matrix 09 Dec 2005, 21:55
r22
>Should I use a byte copy loop until I get the Source 16byte aligned and then use a MOVDQA from the source and a MOVDQU to the dest. i wote on this idea, and of course do pre fetching, read cache lines might be the best, but maeby prefetchnta could do it, allignment difficulties could occour when lets say, source buffer is aligned and destination is not, unfortunatly, but this is rarely the case. i do not have an 64 bit comp to test it ;< btw there is a fast memory move thread in main |
|||
09 Dec 2005, 21:55 |
|
r22 10 Dec 2005, 07:19
Here's my first crack at a memmove function.
It's not tested yet. Code: RtlMoveMemory2: ;;;rcx = dest rdx = src r8 = size cmp rdx,rcx jz .end test r8,r8 jz .end cmp r8,600h ja .SSE .copy: sub r8,16 js .clean mov rax,[rdx+r8] mov r9,[rdx+r8+8] mov [rcx+r8],rax mov [rcx+r8+8],r9 jmp .copy .clean: add r8,16 jz .end dec r8 .bcopy: mov al,[rdx+r8] dec r8 mov [rcx+r8+1],al jz .end jmp .bcopy .SSE: mov r9,r8 mov eax,20h sub r9,4096 .pref: prefetchnta byte [rdx+r9] prefetchnta byte [rdx+r9+40h] add r9,80h dec eax jnz .pref mov eax,20h .SSECopy: movdqu xmm0,[rdx+r8-16] movdqu xmm1,[rdx+r8-32] movdqu xmm2,[rdx+r8-48] movdqu xmm3,[rdx+r8-64] movdqu xmm4,[rdx+r8-80] movdqu xmm5,[rdx+r8-96] movdqu xmm6,[rdx+r8-112] movdqu xmm7,[rdx+r8-128] movdqu [rcx+r8-16],xmm0 movdqu [rcx+r8-32],xmm1 movdqu [rcx+r8-48],xmm2 movdqu [rcx+r8-64],xmm3 movdqu [rcx+r8-80],xmm4 movdqu [rcx+r8-96],xmm5 movdqu [rcx+r8-112],xmm6 movdqu [rcx+r8-128],xmm7 sub r8,128 dec eax jnz .SSECopy cmp r8,4112 jnb .SSE jmp .copy .end: mov rax,rcx ret 0 |
|||
10 Dec 2005, 07:19 |
|
Matrix 10 Dec 2005, 15:18
r22,
neat & tidy code hope its the best, btw i gotto buy a comp... |
|||
10 Dec 2005, 15:18 |
|
Madis731 10 Dec 2005, 18:44
Aren't the 64 and 32-bit variants of Intel CPUs equally priced? You can even buy a 64-bit Celeron for about $100 to test this code
|
|||
10 Dec 2005, 18:44 |
|
r22 10 Dec 2005, 23:00
It seems for unaligned data using MOVDQU isn't the way to go in 64bit land.
MOVDQU xmmx,mem128 latency 9 LDDQU xmmx,mem128 latency 7 MOV reg64,mem64 latency 3 Well now I know why ntdll's memmove doesn't use SSE. |
|||
10 Dec 2005, 23:00 |
|
shism2 11 Dec 2005, 03:16
How about make an optimized for 32-bit version of windows xp ?
|
|||
11 Dec 2005, 03:16 |
|
r22 11 Dec 2005, 04:50
Trying to beat microsofts compiler for this memmove function will take a little longer than expected. I have to admit this function was well coded, but there's always room for improvement.
I'm not that familiar with prefetch and the non temporial stuff. Once I'm done I'll think about patching the 32bit ntdll.dll pretty much the same two functions RtlInitUnicodeString and RtlMoveMemory. But what I really want to do is patch the NTOSKRNL.EXE because those functions are used by drivers so I can expect a really big performance boost. From initial decompiles it has almost an exact copy of the ntdll functions in it. |
|||
11 Dec 2005, 04:50 |
|
Goto page Previous 1, 2, 3, 4 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.