flat assembler
Message board for the users of flat assembler.

Index > Windows > x64 stack alignment, prologue/epilogue method for procedures

Author
Thread Post new topic Reply to topic
Feryno



Joined: 23 Mar 2005
Posts: 509
Location: Czech republic, Slovak republic
Feryno 28 May 2007, 05:39
The heaviest difficulty under win x64 asm coding is keeping stack alignment at dqword (align 16).


1. At first, we must align stack at exe entry point.

why is rsp misaligned at start ?
kernel32.dll:
Code:
0000000078D59630 48894C2408         BaseProcessStart: mov [rsp+08],rcx ; put address of exe entrypoint into the stack
0000000078D59635 4883EC28              sub rsp,28 ; reserve 4 qwords of stack + stack align 16
0000000078D59639 41B908000000                mov r9d,00000008
0000000078D5963F 4C8D442430         lea r8,[rsp+30]
0000000078D59644 418D5101            lea edx,[r9+01]
0000000078D59648 48B9FEFFFFFFFFFFFFFF        mov rcx,FFFFFFFFFFFFFFFE ; -2=hThread of current process
0000000078D59652 FF151080FEFF               call qword [0000000078D41668] ; []=0000000078EF1330=ntdll.NtSetInformationThread
0000000078D59658 FF542430           call qword [rsp+30] ; call the entrypoint
0000000078D5965C 8BC8                      mov ecx,eax ; here we return from executable
0000000078D5965E E8BDDB0000             call 0000000078D67220 ; KERNEL32.ExitThread
0000000078D59663 CC                      int3    


rsp is aligned 16 in kernel32.dll and call qword [rsp+30] calls exe entry point so rsp is 1 qword off 16-bytes alignment at exe entry point

knowing the above behaviour of kernel32.dll, we can make the smallest possible win64 executable:

Code:
start:
   xor     eax,eax         ; return value = 0
  ret    


If return value doesn't matter for you, then you can omit zeroing eax and you can make executable with only 1 instruction, only ret.

But back to alignment...

I personally like to do this step at exe entry point

Code:
start:
       sub     rsp,8*(4+11)    


This perfectly alignes stack 16. As a benefit it leaves 4 qwords of stack space for API and 11 qwords for us.
This is the smallest possible instruction, it has only 4 bytes 48 83 EC 78. If you use bigger number, the instruction has 7 bytes. If you don't plane to call any API in the procedure start, then perhaps the smallest possible solution is e.g.

Code:
start:
    push    rax
 call    main
        pop     rax
 xor     eax,eax
     ret

main:    


But then the task is to align stack at procedure main.


2. Aligning stack in procedures.

This is my preferred way. It has the disadvantage that you can't use push/pop instructions between proc_prologue_done and proc_epilogue. But do you really need pushes/pops when you have 15 registers ??? And if you really need push/pop then you can use mov [rsp+...],reg64 instead !!! This way has 1 small benefit: you can access stack using RSP register, you needn't RBP to do it, so you have 1 extra free register (RBP) !!!

Code:
proc:
; proc_prologue
 push    rcx rdx rbx rsi rdi r8 r9 r10 r11
a=1        ; return address
b=9 ; number of pushed registers
d=(sizeof.LV_ITEM64+7)/8
e=4 ; number of qwords reserved for API
c=(a+b+d+e) and 1        ; alignment at dqword
       sub     rsp,8*(c+d+e)
; proc_prologue_done

virtual at rsp+8*e
lvi_ccc      LV_ITEM64
end virtual

; the stack looks now like:
; a <- the top, contains return address from procedure
; b <- pushed registers
; c <- it is 1 qword or none depending on a,b,d,e and is used to align 16
; d <- LV_ITEM64 structure
; e <- 4 qwords reserved for API
;   <- current RSP

; instructions of your procedure...
; if you need to obtain RCX pushed at proc_prologue, use mov rcx,[rsp+8*(8+e+d+c)]
; if you need to obtain R11 pushed at proc_prologue, use mov rcx,[rsp+8*(e+d+c+0)]

; proc_epilogue
   add     rsp,8*(c+d+e)
       pop     r11 r10 r9 r8 rdi rsi rbx rdx rcx
   ret    


That's all !

This way isn't easy, so I thought how to check the stack again, because we all are humans and we make mistakes !
So I had 4 ideas and combining 2 or 3 of them may rapidly reduce the risk of stack misalignment:
a) check the source code manually again
b) leave the program to be single stepped (utility fta16.exe) - usable only for small executables, it can be too slow for big files. Advantage - it scans everything thoroughly, so there isn't any possibility of undiscovering hidden misalignment !!!
Because Vista dlls are much huger than xp64 dlls, I strongly recommend to use XP64 and not to use Vista64. Checking simple MessageBoxA lasts about 1 minute under XP64 !!!
c) disassemble program and check disassembled output manually or by fxa16.exe (fdisasm.exe your_prog.exe fxa16.exe your_prog.d64) - note, you can use it for checking DLL too, but rename dll to exe at first (fdisasm checks exe extension of input file)
d) using testing instruction inside procedure for causing exception if misalignment, e.g. "movdqa dqword [rsp],xmm0[/code]" you can catch exceptions by debugger
some clever boy may think off macros so movdqa [],xmm is put only in developping stage and not in final (ready to release) compiling (e.g. simple adjusting testing_mode=1 testing_mode=0 ...)


Description: utilities for checking stack alignment, design for prologue/epilogue of procedure
Download
Filename: win64_stack_align.zip
Filesize: 42.41 KB
Downloaded: 443 Time(s)

Post 28 May 2007, 05:39
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
MazeGen



Joined: 06 Oct 2003
Posts: 977
Location: Czechoslovakia
MazeGen 29 May 2007, 15:39
Another useful way of checking 64-bit alignment of all memory accesses is to set the AC bit (number 18 ) in RFlags register. Any misaligned access causes Alignment Check Fault (number 17).

Note that this method is useless in win32 since CR0.AM is always cleared. This bit masks RFlags.AC flag. More about these flags see below.

According to my test under XP x64, this exception can be caught only if the application is run in a debugger (can't be if using single-step). Sample code:
Code:
format PE64 GUI

section '.code' code readable executable

 ; it is expected that CR0.AM is set here

 pushfq
 or qword [rsp], 1 SHL 18           ; set AC bit
 popfq

 mov [aligned_dq+1], rax          ; Alignment Check Fault

 ret                                     ; ExitProcess

section '.data' data readable writeable

aligned_dq dq ?
db ?
    

FDBG.x86asm.net wrote:

Exception. ProcessId=00000638h ThreadId=0000061Ch Address=000000000040100Ah ExceptionCode=80000002h=EXCEPTION_DATATYPE_MISALIGNMENT first_chance

Intel info about CR0.AM and RFlags.AC:
Quote:
Alignment Mask (bit 18 of CR0) — Enables automatic alignment checking when set;
disables alignment checking when clear. Alignment checking is performed only when
the AM flag is set, the AC flag in the EFLAGS register is set, CPL is 3, and the
processor is operating in either protected or virtual-8086 mode.

Quote:

Alignment check (bit 18 ) — Set this flag and the AM flag in control register CR0 to
enable alignment checking of memory references; clear the AC flag and/or the AM flag
to disable alignment checking. An alignment-check exception is generated when reference
is made to an unaligned operand, such as a word at an odd byte address or a
doubleword at an address which is not an integral multiple of four. Alignment-check
exceptions are generated only in user mode (privilege level 3). Memory references that
default to privilege level 0, such as segment descriptor loads, do not generate this
exception even when caused by instructions executed in user-mode.
The alignment-check exception can be used to check alignment of data. This is useful
when exchanging data with processors which require all data to be aligned. The alignment-
check exception can also be used by interpreters to flag some pointers as special
by misaligning the pointer. This eliminates overhead of checking each pointer and only
handles the special pointer when used.

Credits go to https://www.openrce.org/blog/view/359/Alignment_check
Post 29 May 2007, 15:39
View user's profile Send private message Visit poster's website Reply with quote
MazeGen



Joined: 06 Oct 2003
Posts: 977
Location: Czechoslovakia
MazeGen 29 May 2007, 16:06
EliCZ just told me that some functions as memcpy and memset are not AC-aware so if you set AC flag, you can get this exception even from within these functions. This makes searching for your own unaligned accesses more difficult. The only way is to clear AC flag just before you call these functions :S
Post 29 May 2007, 16:06
View user's profile Send private message Visit poster's website Reply with quote
Feryno



Joined: 23 Mar 2005
Posts: 509
Location: Czech republic, Slovak republic
Feryno 30 May 2007, 06:06
Wow MazeGen, that is really nice and simple idea ! Thank for it !

I started this topic because PE32+ exe may run nicely under XP64 but when it is run in Vista x64 it may crash (how much bigger code so much higher probability of crash). The crashing is caused by stack misalignment in sensitive API. Microsoft tried to speed-up APIs so APIs sometimes access 2 qwords of stack by e.g. movdqa [rsp+...],xmm0 and this cause exception.
The problem is that XP64 APIs don't use movdqa so frequently as APIs in Vista do and a lot of bugs may be quiet and hidden if you use XP64.

I still haven't found any exception by misaligment of data, I found only stack misaligment bugs. I hope MS won't change rules in such way that misaligment of data cause API crash too. But who knows whether it don't change in a feature ?

So I developped some not perfect but usable ways to test stack alignment and to reduce this unhappiness. Only reduce, not completely eliminate...
Post 30 May 2007, 06:06
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
asmfan



Joined: 11 Aug 2006
Posts: 392
Location: Russian
asmfan 30 May 2007, 08:53
Code:
      mov     rax,rsp
     and     rax,-16
     test    rsp,1111b
   cmovnz  rsp,rax
    

_________________
Any offers?
Post 30 May 2007, 08:53
View user's profile Send private message Reply with quote
Feryno



Joined: 23 Mar 2005
Posts: 509
Location: Czech republic, Slovak republic
Feryno 30 May 2007, 09:18
to asmfan:
that is an elegant solution too, but the question is how to restore rsp before RET instruction

I saw a clever solution by Jeremy Gordon, perhaps it was something like
Code:
push rsp
push qword [rsp]
add rsp,8
and spl,not 1111b
... instructions of procedure
...
ret    


But we are asm fans and we like to reduce code size, so the smallest solution is
Code:
proc_prologue:
push some_registers
sub rsp,value
... instructions of procedure
...
proc_epilogue:
add rsp,value
pop registers
ret    


There is there only one "vulnerable" thing - value. Perhaps it can be solved by a clever macro by calculating a, b, c, d, e described in the first post of the thread.

edit:
this is Jeremy's trick:
Code:
PUSH RSP             ;save current RSP position on the stack
PUSH [RSP]           ;keep another copy of that on the stack
OR SPL,8             ;adjust RSP to misalign the stack of 8 bytes and to point it to value of RSP to be restored by pop rsp
                     ;
                     ;  parameters dealt with here
                     ;
SUB RSP,38h          ;adjust RSP to provide placeholders and align it at dqword
CALL TheAPI
ADD RSP,38h          ;get RSP back to correct place for next
POP RSP    

and the link:
http://www.masm32.com/board/index.php?topic=4752.msg35524#msg35524


Last edited by Feryno on 03 Jul 2007, 12:52; edited 3 times in total
Post 30 May 2007, 09:18
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid 30 May 2007, 10:05
Feryno: please use [code] tags Wink
they make your posts much more readable.
Post 30 May 2007, 10:05
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
HyperVista



Joined: 18 Apr 2005
Posts: 691
Location: Virginia, USA
HyperVista 31 May 2007, 02:49
First, I find this thread extremly interesting and helpful Thanks Feryno, MazeGen and asmfan.

I have leaf functions written in fasm that are linked into C code as .lib files. I'm currently porting my fasm leaf functions to x64. Do I need to worry about stack allignment in my x64 fasm leaf functions? The leaf funcitons are quite simplistic (I just do simple things like RDMSR, check and set bits in CR0 and CR4, etc.). They do require moving immediate values and MSRs/CRs into registers. And I do have to push and pop a number of registers before and after my function routines. I understand that my 64-bit C driver code will have to ensure stack alignment, but I'm not sure if my x64 fasm leaf functions need the prolog and epilog stack alignment routines you've discuss here.

Thanks.
Post 31 May 2007, 02:49
View user's profile Send private message Visit poster's website Reply with quote
asmfan



Joined: 11 Aug 2006
Posts: 392
Location: Russian
asmfan 31 May 2007, 06:38
Interesting MS article
http://msdn2.microsoft.com/en-us/library/aa290049(VS.71).aspx

PS. what happened to links /no tag processing/?

_________________
Any offers?
Post 31 May 2007, 06:38
View user's profile Send private message Reply with quote
MazeGen



Joined: 06 Oct 2003
Posts: 977
Location: Czechoslovakia
MazeGen 31 May 2007, 10:28
Hi HyperVista,

MSDN is quite good source of x64 basic information, one just needs to get familiar with it.

http://msdn2.microsoft.com/en-us/library/67fa79wz(VS.80).aspx

Quote:
A leaf function is one that does not require a function table entry. It cannot call any functions, allocate space, or save any nonvolatile registers. It is allowed to leave the stack unaligned while it executes.
Post 31 May 2007, 10:28
View user's profile Send private message Visit poster's website Reply with quote
HyperVista



Joined: 18 Apr 2005
Posts: 691
Location: Virginia, USA
HyperVista 31 May 2007, 11:28
Thanks MazeGen and asmfan for the links. I've been pouring through lots of on-line material but had not seen the quote you gave MazeGen re: leaf functions can be left unaligned. It doesn't get more clear than that! Thanks!

asmfan - I noticed too that the link processor is not working, but I think it's not working only with links that contain paranthesis (). I can't say I've ever noticed a url with parenthesis and maybe this is something new and not processed well by this board's software. I'm just guessing, though.

Thanks again guys!
Post 31 May 2007, 11:28
View user's profile Send private message Visit poster's website Reply with quote
Feryno



Joined: 23 Mar 2005
Posts: 509
Location: Czech republic, Slovak republic
Feryno 31 May 2007, 11:59
Hello, HyperVista,
My personal experience is that stack must be aligned only at the time of calling API.
When I make a procedure which doesn't call any API, then I don't worry about stack alignment - but I don't know whether it is right or not - anyway, I have never had any problem in rsp misaligned procedures when they didn't call any API. If the procedure itself uses stack by accessing it e.g. with movdqa so then the stack must be aligned (or misaligned but addressed with the shift of 8 bytes). If you produce your own DLL which doesn't use movdqa to access stack space, then you can call function of the DLL without being worried - but only in the case when the function itself doesn't call sensitive microsoft's API (the best assumption is to suppose every ms API to be sensitive - if not at the present, then certainly in the feature...)
Not every microsoft API causes crash, e.g. under XP64 SP1 as well SP2 MessageBoxA is OK with misaligned stack, but Vista RTM crashes in simple MessageBoxA with unaligned RSP.
I think that in drivers, the stack misalignment is even more important than in ring3 code.

I started this thread only because of discovering a lot of silent mistakes in my project (fdbg) - it run well on XP64 but mistakes appeared in Vista, so I wondered whether my project didn't contain more stack misalignments still silent and hidden in current version of Vista. I let fdbg to be single stepped with small utility, but it was too slow. Then I disassembled fdbg.exe and analyzed disassembled output with another small utility - the utility scanned disassembled output for strings
sub rsp,
and then the utility calculated number of pushed registers preceeding sub rsp instruction and finally the utility calculated whether stack is or isn't aligned properly.
This solution isn't ultimate and it can be easily cheated by inserting other instruction between pushes and sub rsp at procedure prologues - the routine for checking matching of stack alignment is extremelly simple and without any artificial intelligence. So this utility only decreases the probability of occurance of misalignment.
sample how to cheat the utility:
push rbx
mov rbx,rax ; <-here the simple checking utility stops and it thinks that there isn't any push preceeding sub rsp,8*4
sub rsp,8*4 ; <-here the routine starts check, it caluculates 4 qwords for subtracting
...
add rsp,8*4
pop rbx
ret

this procedure gives stack alignment OK result:
push rbx ; <- second step after getting 4 qwords, now the checking routine calculates 1 qword of pushed register
sub rsp,8*4 ; <- start of calculating, checking routine calculates 4 qwords
mov rbx,rax
...
add rsp,8*4
pop rbx
ret
the final step is 1+4=5 and thus the result is OK because 1 qword of return address from procedure is at the top so 5+1=6 qwords and stack alignment is kept.

But even the disasm output needn't to be exact, e.g.
jmp label1
db 'string'
label1:

I think that we can easily avoid stack misalignment by keepin the rule not using push/pops/modifications of stack between procedure prologue and procedure epilogue. Aligning the stack in prologue should be done by clever macros by the method described in the first post.
My idea is using macros, something like:
proc good_procedure2 uses rbx rdi, LVITEM
then the macro calculates number of pushed registers, size of LVITEM structure etc. as the manual calculation in the first post.
I know the manual method, but I'm not able to construct such automated sofisticated macro...
Especially for drivers it would be great idea to use macros to avoid human mistakes and to leave the job for automated macro. This is the time when I decided that macros are necessary in win64 and that they may reduce (or even ultimately avoid) the risk of human mistakes. Even if somebody produces such macros, it is good to know the concept and the reasons of development of such macros, so I suppose this thread to be usefull and perhaps at least short story should be included in commented part of macro ending with the 2 sentences:
The worst on stack misalignment is the fact that it is not easy to find it. Stack misalignment is in user's procedure, but the crash occurs in the space of code segment of DLL by accessing stack with e.g. movdqa xmm0,[rsp + ...] instruction. Or even the crash doesn't occur now, but may will occur in the feature...

P.S. the finding of unaligned program's procedure is done by analyzing qwords in the stack when the exception occurs by checking which qword is the return address of procedure - but the N-th subprocedure should be N-th call from program's procedure and evey subproc may subtract a lot of qwords of stack...
Post 31 May 2007, 11:59
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4020
Location: vpcmpistri
bitRAKE 12 Aug 2008, 03:42
For size optimization ENTER/LEAVE instructions are definitely the way to go, imho.
Code:
        enter 8*4,0
...
        leave
        retn    
...yeah, it is most likely slower.

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 12 Aug 2008, 03:42
View user's profile Send private message Visit poster's website Reply with quote
asmfan



Joined: 11 Aug 2006
Posts: 392
Location: Russian
asmfan 12 Aug 2008, 10:29
must point that enter must take even number as 1st param to align stack by 16 because rbp is automatically pushed by enter.
((cnt_of_params + 1) and -2)*8
Post 12 Aug 2008, 10:29
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4020
Location: vpcmpistri
bitRAKE 12 Aug 2008, 14:51
asmfan wrote:
must point that enter must take even number as 1st param to align stack by 16 because rbp is automatically pushed by enter.
((cnt_of_params + 1) and -2)*8
enter ((cnt_of_params + cnt_of_frames + 1)/2)*16, cnt_of_frames

It might be advantagous to do something like:
Code:
enter 0,13
pop rbp
pop rcx
pop rdx
pop r8
pop r9
push r9
push r8
push edx
push rcx
call [CreateWindowEx]
leave    
...copying the prior frames parameters to execute the same API multiple times. Idea

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 12 Aug 2008, 14:51
View user's profile Send private message Visit poster's website Reply with quote
Feryno



Joined: 23 Mar 2005
Posts: 509
Location: Czech republic, Slovak republic
Feryno 13 Aug 2008, 06:41
the stack alignment is extremely necessary especially in drivers written in FASM
it will be a big goal if we find an easy and safe method to do that (and fast of course, and to use as less instructions as possible)
Post 13 Aug 2008, 06:41
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4020
Location: vpcmpistri
bitRAKE 13 Aug 2008, 21:42
Feryno, thanks for your work on fdbg - I'm using it daily and I know countless hours of frustraition have been saved by yours. Wink ENTER/LEAVE is the general solution I'll be using, but optimization for speed will require custom solutions as usual. The shorter addressing of RBP offsets and the liberation of RBP for use through API calls makes ENTER a better solution, imho. Of course, RBP/RSP would have to be restored in an alternate way when RBP is used. I also imagine pushing RSI/RDI prior to ENTER in many routines, or a custom calling convention internally. Oh where is PUSHAQ AMD? Didn't they see the value in it?

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 13 Aug 2008, 21:42
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 14 Aug 2008, 13:31
Yeah - I've missed PUSHAQ. They think that with 16 registers you don't need to push Razz häh!

When it would have been okay to push 8 registers, now there are SIXTEEN of them and no other way than manually.
Post 14 Aug 2008, 13:31
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.