flat assembler
Message board for the users of flat assembler.
Index
> Windows > x64 stack alignment, prologue/epilogue method for procedures |
Author |
|
Feryno 28 May 2007, 05:39
The heaviest difficulty under win x64 asm coding is keeping stack alignment at dqword (align 16).
1. At first, we must align stack at exe entry point. why is rsp misaligned at start ? kernel32.dll: Code: 0000000078D59630 48894C2408 BaseProcessStart: mov [rsp+08],rcx ; put address of exe entrypoint into the stack 0000000078D59635 4883EC28 sub rsp,28 ; reserve 4 qwords of stack + stack align 16 0000000078D59639 41B908000000 mov r9d,00000008 0000000078D5963F 4C8D442430 lea r8,[rsp+30] 0000000078D59644 418D5101 lea edx,[r9+01] 0000000078D59648 48B9FEFFFFFFFFFFFFFF mov rcx,FFFFFFFFFFFFFFFE ; -2=hThread of current process 0000000078D59652 FF151080FEFF call qword [0000000078D41668] ; []=0000000078EF1330=ntdll.NtSetInformationThread 0000000078D59658 FF542430 call qword [rsp+30] ; call the entrypoint 0000000078D5965C 8BC8 mov ecx,eax ; here we return from executable 0000000078D5965E E8BDDB0000 call 0000000078D67220 ; KERNEL32.ExitThread 0000000078D59663 CC int3 rsp is aligned 16 in kernel32.dll and call qword [rsp+30] calls exe entry point so rsp is 1 qword off 16-bytes alignment at exe entry point knowing the above behaviour of kernel32.dll, we can make the smallest possible win64 executable: Code: start: xor eax,eax ; return value = 0 ret If return value doesn't matter for you, then you can omit zeroing eax and you can make executable with only 1 instruction, only ret. But back to alignment... I personally like to do this step at exe entry point Code: start: sub rsp,8*(4+11) This perfectly alignes stack 16. As a benefit it leaves 4 qwords of stack space for API and 11 qwords for us. This is the smallest possible instruction, it has only 4 bytes 48 83 EC 78. If you use bigger number, the instruction has 7 bytes. If you don't plane to call any API in the procedure start, then perhaps the smallest possible solution is e.g. Code: start: push rax call main pop rax xor eax,eax ret main: But then the task is to align stack at procedure main. 2. Aligning stack in procedures. This is my preferred way. It has the disadvantage that you can't use push/pop instructions between proc_prologue_done and proc_epilogue. But do you really need pushes/pops when you have 15 registers ??? And if you really need push/pop then you can use mov [rsp+...],reg64 instead !!! This way has 1 small benefit: you can access stack using RSP register, you needn't RBP to do it, so you have 1 extra free register (RBP) !!! Code: proc: ; proc_prologue push rcx rdx rbx rsi rdi r8 r9 r10 r11 a=1 ; return address b=9 ; number of pushed registers d=(sizeof.LV_ITEM64+7)/8 e=4 ; number of qwords reserved for API c=(a+b+d+e) and 1 ; alignment at dqword sub rsp,8*(c+d+e) ; proc_prologue_done virtual at rsp+8*e lvi_ccc LV_ITEM64 end virtual ; the stack looks now like: ; a <- the top, contains return address from procedure ; b <- pushed registers ; c <- it is 1 qword or none depending on a,b,d,e and is used to align 16 ; d <- LV_ITEM64 structure ; e <- 4 qwords reserved for API ; <- current RSP ; instructions of your procedure... ; if you need to obtain RCX pushed at proc_prologue, use mov rcx,[rsp+8*(8+e+d+c)] ; if you need to obtain R11 pushed at proc_prologue, use mov rcx,[rsp+8*(e+d+c+0)] ; proc_epilogue add rsp,8*(c+d+e) pop r11 r10 r9 r8 rdi rsi rbx rdx rcx ret That's all ! This way isn't easy, so I thought how to check the stack again, because we all are humans and we make mistakes ! So I had 4 ideas and combining 2 or 3 of them may rapidly reduce the risk of stack misalignment: a) check the source code manually again b) leave the program to be single stepped (utility fta16.exe) - usable only for small executables, it can be too slow for big files. Advantage - it scans everything thoroughly, so there isn't any possibility of undiscovering hidden misalignment !!! Because Vista dlls are much huger than xp64 dlls, I strongly recommend to use XP64 and not to use Vista64. Checking simple MessageBoxA lasts about 1 minute under XP64 !!! c) disassemble program and check disassembled output manually or by fxa16.exe (fdisasm.exe your_prog.exe fxa16.exe your_prog.d64) - note, you can use it for checking DLL too, but rename dll to exe at first (fdisasm checks exe extension of input file) d) using testing instruction inside procedure for causing exception if misalignment, e.g. "movdqa dqword [rsp],xmm0[/code]" you can catch exceptions by debugger some clever boy may think off macros so movdqa [],xmm is put only in developping stage and not in final (ready to release) compiling (e.g. simple adjusting testing_mode=1 testing_mode=0 ...)
|
|||||||||||
28 May 2007, 05:39 |
|
MazeGen 29 May 2007, 15:39
Another useful way of checking 64-bit alignment of all memory accesses is to set the AC bit (number 18 ) in RFlags register. Any misaligned access causes Alignment Check Fault (number 17).
Note that this method is useless in win32 since CR0.AM is always cleared. This bit masks RFlags.AC flag. More about these flags see below. According to my test under XP x64, this exception can be caught only if the application is run in a debugger (can't be if using single-step). Sample code: Code: format PE64 GUI section '.code' code readable executable ; it is expected that CR0.AM is set here pushfq or qword [rsp], 1 SHL 18 ; set AC bit popfq mov [aligned_dq+1], rax ; Alignment Check Fault ret ; ExitProcess section '.data' data readable writeable aligned_dq dq ? db ? FDBG.x86asm.net wrote:
Intel info about CR0.AM and RFlags.AC: Quote: Alignment Mask (bit 18 of CR0) — Enables automatic alignment checking when set; Quote:
Credits go to https://www.openrce.org/blog/view/359/Alignment_check |
|||
29 May 2007, 15:39 |
|
Feryno 30 May 2007, 06:06
Wow MazeGen, that is really nice and simple idea ! Thank for it !
I started this topic because PE32+ exe may run nicely under XP64 but when it is run in Vista x64 it may crash (how much bigger code so much higher probability of crash). The crashing is caused by stack misalignment in sensitive API. Microsoft tried to speed-up APIs so APIs sometimes access 2 qwords of stack by e.g. movdqa [rsp+...],xmm0 and this cause exception. The problem is that XP64 APIs don't use movdqa so frequently as APIs in Vista do and a lot of bugs may be quiet and hidden if you use XP64. I still haven't found any exception by misaligment of data, I found only stack misaligment bugs. I hope MS won't change rules in such way that misaligment of data cause API crash too. But who knows whether it don't change in a feature ? So I developped some not perfect but usable ways to test stack alignment and to reduce this unhappiness. Only reduce, not completely eliminate... |
|||
30 May 2007, 06:06 |
|
asmfan 30 May 2007, 08:53
Code: mov rax,rsp and rax,-16 test rsp,1111b cmovnz rsp,rax _________________ Any offers? |
|||
30 May 2007, 08:53 |
|
Feryno 30 May 2007, 09:18
to asmfan:
that is an elegant solution too, but the question is how to restore rsp before RET instruction I saw a clever solution by Jeremy Gordon, perhaps it was something like Code: push rsp push qword [rsp] add rsp,8 and spl,not 1111b ... instructions of procedure ... ret But we are asm fans and we like to reduce code size, so the smallest solution is Code: proc_prologue: push some_registers sub rsp,value ... instructions of procedure ... proc_epilogue: add rsp,value pop registers ret There is there only one "vulnerable" thing - value. Perhaps it can be solved by a clever macro by calculating a, b, c, d, e described in the first post of the thread. edit: this is Jeremy's trick: Code: PUSH RSP ;save current RSP position on the stack PUSH [RSP] ;keep another copy of that on the stack OR SPL,8 ;adjust RSP to misalign the stack of 8 bytes and to point it to value of RSP to be restored by pop rsp ; ; parameters dealt with here ; SUB RSP,38h ;adjust RSP to provide placeholders and align it at dqword CALL TheAPI ADD RSP,38h ;get RSP back to correct place for next POP RSP and the link: http://www.masm32.com/board/index.php?topic=4752.msg35524#msg35524 Last edited by Feryno on 03 Jul 2007, 12:52; edited 3 times in total |
|||
30 May 2007, 09:18 |
|
vid 30 May 2007, 10:05
Feryno: please use [code] tags
they make your posts much more readable. |
|||
30 May 2007, 10:05 |
|
HyperVista 31 May 2007, 02:49
First, I find this thread extremly interesting and helpful Thanks Feryno, MazeGen and asmfan.
I have leaf functions written in fasm that are linked into C code as .lib files. I'm currently porting my fasm leaf functions to x64. Do I need to worry about stack allignment in my x64 fasm leaf functions? The leaf funcitons are quite simplistic (I just do simple things like RDMSR, check and set bits in CR0 and CR4, etc.). They do require moving immediate values and MSRs/CRs into registers. And I do have to push and pop a number of registers before and after my function routines. I understand that my 64-bit C driver code will have to ensure stack alignment, but I'm not sure if my x64 fasm leaf functions need the prolog and epilog stack alignment routines you've discuss here. Thanks. |
|||
31 May 2007, 02:49 |
|
asmfan 31 May 2007, 06:38
Interesting MS article
http://msdn2.microsoft.com/en-us/library/aa290049(VS.71).aspx PS. what happened to links /no tag processing/? _________________ Any offers? |
|||
31 May 2007, 06:38 |
|
MazeGen 31 May 2007, 10:28
Hi HyperVista,
MSDN is quite good source of x64 basic information, one just needs to get familiar with it. http://msdn2.microsoft.com/en-us/library/67fa79wz(VS.80).aspx Quote: A leaf function is one that does not require a function table entry. It cannot call any functions, allocate space, or save any nonvolatile registers. It is allowed to leave the stack unaligned while it executes. |
|||
31 May 2007, 10:28 |
|
HyperVista 31 May 2007, 11:28
Thanks MazeGen and asmfan for the links. I've been pouring through lots of on-line material but had not seen the quote you gave MazeGen re: leaf functions can be left unaligned. It doesn't get more clear than that! Thanks!
asmfan - I noticed too that the link processor is not working, but I think it's not working only with links that contain paranthesis (). I can't say I've ever noticed a url with parenthesis and maybe this is something new and not processed well by this board's software. I'm just guessing, though. Thanks again guys! |
|||
31 May 2007, 11:28 |
|
Feryno 31 May 2007, 11:59
Hello, HyperVista,
My personal experience is that stack must be aligned only at the time of calling API. When I make a procedure which doesn't call any API, then I don't worry about stack alignment - but I don't know whether it is right or not - anyway, I have never had any problem in rsp misaligned procedures when they didn't call any API. If the procedure itself uses stack by accessing it e.g. with movdqa so then the stack must be aligned (or misaligned but addressed with the shift of 8 bytes). If you produce your own DLL which doesn't use movdqa to access stack space, then you can call function of the DLL without being worried - but only in the case when the function itself doesn't call sensitive microsoft's API (the best assumption is to suppose every ms API to be sensitive - if not at the present, then certainly in the feature...) Not every microsoft API causes crash, e.g. under XP64 SP1 as well SP2 MessageBoxA is OK with misaligned stack, but Vista RTM crashes in simple MessageBoxA with unaligned RSP. I think that in drivers, the stack misalignment is even more important than in ring3 code. I started this thread only because of discovering a lot of silent mistakes in my project (fdbg) - it run well on XP64 but mistakes appeared in Vista, so I wondered whether my project didn't contain more stack misalignments still silent and hidden in current version of Vista. I let fdbg to be single stepped with small utility, but it was too slow. Then I disassembled fdbg.exe and analyzed disassembled output with another small utility - the utility scanned disassembled output for strings sub rsp, and then the utility calculated number of pushed registers preceeding sub rsp instruction and finally the utility calculated whether stack is or isn't aligned properly. This solution isn't ultimate and it can be easily cheated by inserting other instruction between pushes and sub rsp at procedure prologues - the routine for checking matching of stack alignment is extremelly simple and without any artificial intelligence. So this utility only decreases the probability of occurance of misalignment. sample how to cheat the utility: push rbx mov rbx,rax ; <-here the simple checking utility stops and it thinks that there isn't any push preceeding sub rsp,8*4 sub rsp,8*4 ; <-here the routine starts check, it caluculates 4 qwords for subtracting ... add rsp,8*4 pop rbx ret this procedure gives stack alignment OK result: push rbx ; <- second step after getting 4 qwords, now the checking routine calculates 1 qword of pushed register sub rsp,8*4 ; <- start of calculating, checking routine calculates 4 qwords mov rbx,rax ... add rsp,8*4 pop rbx ret the final step is 1+4=5 and thus the result is OK because 1 qword of return address from procedure is at the top so 5+1=6 qwords and stack alignment is kept. But even the disasm output needn't to be exact, e.g. jmp label1 db 'string' label1: I think that we can easily avoid stack misalignment by keepin the rule not using push/pops/modifications of stack between procedure prologue and procedure epilogue. Aligning the stack in prologue should be done by clever macros by the method described in the first post. My idea is using macros, something like: proc good_procedure2 uses rbx rdi, LVITEM then the macro calculates number of pushed registers, size of LVITEM structure etc. as the manual calculation in the first post. I know the manual method, but I'm not able to construct such automated sofisticated macro... Especially for drivers it would be great idea to use macros to avoid human mistakes and to leave the job for automated macro. This is the time when I decided that macros are necessary in win64 and that they may reduce (or even ultimately avoid) the risk of human mistakes. Even if somebody produces such macros, it is good to know the concept and the reasons of development of such macros, so I suppose this thread to be usefull and perhaps at least short story should be included in commented part of macro ending with the 2 sentences: The worst on stack misalignment is the fact that it is not easy to find it. Stack misalignment is in user's procedure, but the crash occurs in the space of code segment of DLL by accessing stack with e.g. movdqa xmm0,[rsp + ...] instruction. Or even the crash doesn't occur now, but may will occur in the feature... P.S. the finding of unaligned program's procedure is done by analyzing qwords in the stack when the exception occurs by checking which qword is the return address of procedure - but the N-th subprocedure should be N-th call from program's procedure and evey subproc may subtract a lot of qwords of stack... |
|||
31 May 2007, 11:59 |
|
bitRAKE 12 Aug 2008, 03:42
For size optimization ENTER/LEAVE instructions are definitely the way to go, imho.
Code: enter 8*4,0 ... leave retn _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
12 Aug 2008, 03:42 |
|
asmfan 12 Aug 2008, 10:29
must point that enter must take even number as 1st param to align stack by 16 because rbp is automatically pushed by enter.
((cnt_of_params + 1) and -2)*8 |
|||
12 Aug 2008, 10:29 |
|
bitRAKE 12 Aug 2008, 14:51
asmfan wrote: must point that enter must take even number as 1st param to align stack by 16 because rbp is automatically pushed by enter. It might be advantagous to do something like: Code: enter 0,13 pop rbp pop rcx pop rdx pop r8 pop r9 push r9 push r8 push edx push rcx call [CreateWindowEx] leave _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
12 Aug 2008, 14:51 |
|
Feryno 13 Aug 2008, 06:41
the stack alignment is extremely necessary especially in drivers written in FASM
it will be a big goal if we find an easy and safe method to do that (and fast of course, and to use as less instructions as possible) |
|||
13 Aug 2008, 06:41 |
|
bitRAKE 13 Aug 2008, 21:42
Feryno, thanks for your work on fdbg - I'm using it daily and I know countless hours of frustraition have been saved by yours. ENTER/LEAVE is the general solution I'll be using, but optimization for speed will require custom solutions as usual. The shorter addressing of RBP offsets and the liberation of RBP for use through API calls makes ENTER a better solution, imho. Of course, RBP/RSP would have to be restored in an alternate way when RBP is used. I also imagine pushing RSI/RDI prior to ENTER in many routines, or a custom calling convention internally. Oh where is PUSHAQ AMD? Didn't they see the value in it?
_________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
13 Aug 2008, 21:42 |
|
Madis731 14 Aug 2008, 13:31
Yeah - I've missed PUSHAQ. They think that with 16 registers you don't need to push häh!
When it would have been okay to push 8 registers, now there are SIXTEEN of them and no other way than manually. |
|||
14 Aug 2008, 13:31 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.