[help] More on x64 16-byte stack alignment

Index > Windows > [help] More on x64 16-byte stack alignment

Author

Thread

alorent

Joined: 05 Dec 2005
Posts: 221

alorent 03 Feb 2010, 20:23

Hello guys,

Is it possible to make the fastcall macro to automatically set up the stack to 16-byte alignment? I know that we can try to keep track of the stack, but when coding, we can introduce errors.

Here is an example:

Code:

format PE64 GUI 5.0
entry start

include 'win64a.inc'

section '.text' code readable executable

  start:

        db 0cch

        sub     rsp,8           ; Make stack dqword aligned

        invoke MessageBox,NULL,_test,NULL,MB_OK

        call    StartThread

        cmp     eax, 0
        jne     thread_ok

        invoke MessageBox,NULL,_error,NULL,MB_ICONERROR+MB_OK
        jmp     exit

  thread_ok:
        invoke MessageBox,NULL,_testok,NULL,MB_OK

  exit:
        invoke ExitProcess, 0

; thread start function

StartThread:
        invoke CreateThread, 0, 0, thread_code, 0, 0, 0
        ret

; thread body

thread_code:
        jmp     thread_code



section '.data' data readable writeable

  _title TCHAR 'Win64 program template',0
  _class TCHAR 'FASMWIN64',0
  _error TCHAR 'Startup failed.',0
  _test  TCHAR 'Test is going to start...',0
  _testok  TCHAR 'Thread created successfully',0

section '.idata' import data readable writeable

  library kernel32,'KERNEL32.DLL',\
          user32,'USER32.DLL'

  include 'api\kernel32.inc'
  include 'api\user32.inc'

As you can see, the "CreateThread" fails because the stack is not aligned to 16-byte anymore when executed from the "call" instruction.

Thanks!!

03 Feb 2010, 20:23

MazeGen

Joined: 06 Oct 2003
Posts: 977
Location: Czechoslovakia

MazeGen 03 Feb 2010, 20:48

GoAsm uses this ugly, but working code:

http://www.jorgon.freeserve.co.uk/GoasmHelp/64bits.htm#invokec

I don't know how to implement it using fasm macro facility though.

03 Feb 2010, 20:48

LocoDelAssembly
Your code has a bug

Joined: 06 May 2005
Posts: 4623
Location: Argentina

LocoDelAssembly 03 Feb 2010, 21:18

alorent, invoke already does proper alignment but it has the precondition that RSP is 16-byte aligned before using it. You forgot to add another "sub rsp,8" (or just "push rbp") at StartThread, so the precondition is broken here. (Probably you know this, but since all the other calls have four parameters it is not so clear whether you are talking about the non-real inability of invoke to align RSP with any amount of args or invoke's inability to work with random RSP alignment).

To implement what MazeGen posted add this after "include 'win32a.inc'"

Code:

macro invoke name, [args]{
common
   PUSH RSP             ;save current RSP position on the stack
   PUSH qword [RSP]           ;keep another copy of that on the stack
   AND SPL,0F0h         ;adjust RSP to align the stack if not already there
   invoke name, args
   POP RSP              ;restore RSP to its original value
}

(All I've added is in lowercase)
Unless I'm missing something that should do it.

03 Feb 2010, 21:18

baldr

Joined: 19 Mar 2008
Posts: 1651

baldr 03 Feb 2010, 22:09

LocoDelAssembly,

Consider the following case:

Code:

;       addr: value
; rsp==10018: ?
   PUSH RSP             ;save current RSP position on the stack
; rsp==10010: 10018
;      10018: ?
   PUSH qword [RSP]           ;keep another copy of that on the stack
; rsp==10008: 10018
;      10010: 10018
;      10018: ?
   AND SPL,0F0h         ;adjust RSP to align the stack if not already there
; rsp==10000: ???
;      10008: 10018
;      10010: 10018
;      10018: ?
   invoke name, args
   POP RSP              ;restore RSP to its original value
; rsp==???

Second variant with or spl, 8 (for odd parameters count) works alright because that or moves stack pointer toward greater addresses, if at all. To correct your code, insert add rsp, 8 before and spl, -16, for example.

x86-64 calling conventions are quite confusing for me, to say the least. "The stack pointer must be aligned to 16 bytes, except for leaf functions, in any region of code that isn’t part of an epilog or prolog." from "Unwindability" clause on Overview of x64 Calling Conventions — I'm speechless. Always push/pop twice/in pairs?

Does somebody know plausible explanation of this 16-byte-align decision?

03 Feb 2010, 22:09

hopcode

Joined: 04 Mar 2008
Posts: 563
Location: Germany

hopcode 04 Feb 2010, 00:45

for me it works on Vista 64 SP2, CreateThread retval=90h

Code:

;on start :              rsp 000000000006FF58
;on call StartThread rsp 000000000006FF50 ;<-----------
;in StartThread       rsp 000000000006FF48   
;on CreateThread    rsp 000000000006FF18

00000000778CC1BF nop
00000000778CC1C0 KERNEL32.CreateThread: sub rsp,48  ;<-------
00000000778CC1C4 mov rax,[rsp+78]
00000000778CC1C9 mov [rsp+30],rax
00000000778CC1CE mov eax,[rsp+70]
00000000778CC1D2 mov [rsp+28],eax
00000000778CC1D6 mov [rsp+20],r9
00000000778CC1DB mov r9,r8
00000000778CC1DE mov r8,rdx
00000000778CC1E1 mov rdx,rcx
00000000778CC1E4 or rcx,FFFFFFFFFFFFFFFF
00000000778CC1E8 call 00000000778CC200 ; KERNEL32.CreateRemoteThread
00000000778CC1ED add rsp,48
00000000778CC1F1 ret 

;after CreateThread rsp 000000000006FF18
;on    cmp eax,0    rsp 000000000006FF50 <-------

04 Feb 2010, 00:45

bitRAKE

Joined: 21 Jul 2003
Posts: 4249
Location: vpcmpistri

bitRAKE 04 Feb 2010, 03:57

My own code looks like this:

Code:

     entry $
     sub rsp,.local
      enter .frame,0
      virtual at rbp-.frame
               ; parameter space, 20 qwords maximum for byte offsets
                       rq 1    ; rcx / xmm0L
                       rq 1    ; rdx / xmm1L
                       rq 1    ; r8  / xmm2L
                       rq 1    ; r9  / xmm3L
               .4      rq 1
                .5      rq 1
                .6      rq 1
                .7      rq 1
                .8      rq 1
                .9      rq 1
                .A      rq 1
                .B      rq 1
                .C      rq 1
                .D      rq 1
                .E      rq 1
                .F      rq 1
                .G      rq 1
                .H      rq 1
                .I      rq 1
                .J      rq 1

            .frame = NOT 15 AND ($-$$+15)
                       rb $$+.frame-$ ; stack alignment

                ; RBP is never restored
             .RBP    rq 1

            .ofn    OPENFILENAME
                .file   rw .file..
          .file.. = 4096

          .gms    MEMORYSTATUS

            .pMemory        rq 1
                .iMemory        rq 1 ; bytes
                .pFile          rq 1
                .iFile          rq 1 ; bytes
                .pBlocks        rq 1
                .iBlocks        rq 1 ; count
                .pHash          rq 1
                .iHash          rq 1 ; bytes

            .local = NOT 15 AND ($-.RBP+7)
                      rb .RBP+8+.local-$ ; stack alignment

            .RET0   rq 1    ; to KERNEL32.BaseThreadInitThunk
                   rq 4
                .RET1   rq 1
        end virtual

I don't use the PROC macros because this better documents the interface, imho. ENTER quickly sets up the required shadow space:

Code:

      .WM_CLOSE:
              enter 8*4,0 ; required shadow space
         ; hWnd in RCX
               xor edx,edx
         call [EndDialog]
            leave
               mov eax,1
           retn

On entry all processes are 8 (MOD 16). ENTER pushes RBP making the stack aligned - as long as the first parameter is 0(MOD 16) then the stack will remain aligned. LEAVE cleans up regardless of RSP use. Seems like a concise solution.

04 Feb 2010, 03:57

alorent

Joined: 05 Dec 2005
Posts: 221

alorent 04 Feb 2010, 08:19

baldr wrote:

LocoDelAssembly,

Consider the following case:
Code:
;       addr: value
; rsp==10018: ?
   PUSH RSP             ;save current RSP position on the stack
; rsp==10010: 10018
;      10018: ?
   PUSH qword [RSP]           ;keep another copy of that on the stack
; rsp==10008: 10018
;      10010: 10018
;      10018: ?
   AND SPL,0F0h         ;adjust RSP to align the stack if not already there
; rsp==10000: ???
;      10008: 10018
;      10010: 10018
;      10018: ?
   invoke name, args
   POP RSP              ;restore RSP to its original value
; rsp==???    
Second variant with or spl, 8 (for odd parameters count) works alright because that or moves stack pointer toward greater addresses, if at all. To correct your code, insert add rsp, 8 before and spl, -16, for example.

x86-64 calling conventions are quite confusing for me, to say the least. "The stack pointer must be aligned to 16 bytes, except for leaf functions, in any region of code that isn’t part of an epilog or prolog." from "Unwindability" clause on Overview of x64 Calling Conventions — I'm speechless. Always push/pop twice/in pairs?

Does somebody know plausible explanation of this 16-byte-align decision?

baldr, thanks! You are right! I needs that extra "add rsp, 8" to make it work as expected!!!

Thank you all guys for all your help!

ANyway, it's strange the 16-byte requirement on 64-bit when the stack alignment should be the WORD-SIZE for that platform (8 bytes), like it's 4-byte for the 32-bit platform.

04 Feb 2010, 08:19

Madis731

Joined: 25 Sep 2003
Posts: 2138
Location: Estonia

Madis731 04 Feb 2010, 11:54

@alorent:
You can pass xmm registers and xmm content on [rsp] and when a function is optimized with movdqA (aligned) you'll crash everytime this instruction is executed with non-(rsp=rsp and not 0xF)

04 Feb 2010, 11:54

hopcode

Joined: 04 Mar 2008
Posts: 563
Location: Germany

hopcode 07 Feb 2010, 02:06

Some consideration about the model _fastcall:
(I hope i dont enter the off-topic here.)
- Recent versions of MS 64bit OS are stack-aware. This is the reason
i couldnt reproduce the error on CreateThread in the bad alignment.
I think MS is encouraging developers to a port 64.

bitRAKE wrote:

Seems like a concise solution.

Yes, i like it. But if i do not mistake it doesnt follow the MS standard
here is the MS standard
http://msdn.microsoft.com/en-us/library/tawsa7cb%28VS.80%29.aspx
In few words:
- One cannot use ENTER/LEAVE (the same applies to the epilog of the proc macro)
- One must use in epilog add esp,size / lea
- If you want MS and HLL compatibility the stack structure must follow this skemata:
http://msdn.microsoft.com/en-us/library/ew5tede7%28VS.80%29.aspx
- very important what Madis731 said (if one is about to write a proc macro)
- important: the official indirect answer to baldr

baldr wrote:

Always push/pop twice/in pairs?

M$ Prolog and Epilog wrote:

The associated unwind data must describe the action of the prolog and must provide the information necessary to undo the effect of the prolog code.

and

M$ Prolog and Epilog wrote:

The epilog code must follow a strict set of rules for the unwind code to reliably unwind through exceptions and interrupts. This reduces the amount of unwind data required, because no extra data is needed to describe each epilog. Instead, the unwind code can determine that an epilog is being executed by scanning forward through a code stream to identify an epilog.

also, now results plain what follows:

Quote:

The stack will always be maintained 16-byte aligned, except within the prolog (for example, after the return address is pushed), and except where indicated in Function Types for a certain class of frame functions.

It is to say, from the prolog up one should have always 16 aligned stack in code.
Then, reading bitRAKE i have got an ide how to port. code32 to code64. It has visible limitations of course. i have not tested it. Now here the hello world example, the un-macroed way. Apart from alloca, it should follow the MS standard.

Code:

format PE64 GUI 5.0
entry start
include 'win64a.inc'
include "..\macro\mrk_macro.inc"
section '.data' data readable writeable
  _title TCHAR 'Win64 program template',0
  _class TCHAR 'FASMWIN64',0
  _error TCHAR 'Startup failed.',0
  wc WNDCLASSEX sizeof.WNDCLASSEX,0,WindowProc,0,0,NULL,NULL,NULL,COLOR_BTNFACE+1,NULL,_class,NULL
  msg MSG
section '.code' code readable executable

start:
 sub rsp,8
 sub rsp,20h                
 xor rcx,rcx                        
 call [GetModuleHandle]
 add rsp,20h
 mov [wc.hInstance],rax
      
 sub rsp,20h
 mov rdx,IDI_APPLICATION
 xor rcx,rcx
 call [LoadIcon]
 add rsp,20h
 mov [wc.hIcon],rax
 mov [wc.hIconSm],rax
 sub rsp,20h
 mov rdx,IDC_ARROW
 xor rcx,rcx
 call [LoadCursor]
 add rsp,20h
 mov [wc.hCursor],rax

 sub rsp,20h
 mov rcx,wc
 call [RegisterClassEx]
 add rsp,20h
 test rax,rax
 jz error
 xor r10,r10

 push 0
 push [wc.hInstance]
 push 0
 push 0
 push 192
 push 256
 push 128
 push 128

 sub rsp,20h                          ;using more than 4 parameters push 4paras*8bytes

 mov r9,WS_VISIBLE+WS_DLGFRAME+WS_SYSMENU
 mov r8,_title
 mov rdx,_class
 xor rcx,rcx
 call [CreateWindowEx]

 add rsp,20h+40h       ; restore 4paras*8bytes + 8paras*8bytes

 test     rax,rax
 jz error

msg_loop:
 sub rsp,20h
 xor r9,r9
 xor r8,r8
 xor rdx,rdx
 mov rcx,msg
 call [GetMessage]
 add rsp,20h
 cmp     eax,1
 jb end_loop
 jne msg_loop

 sub rsp,20h
 mov rcx,msg
 call [TranslateMessage]
 add rsp,20h

 sub rsp,20h
 mov rcx,msg
 call [DispatchMessage]
 add rsp,20h
 jmp   msg_loop

error:
 sub rsp,20h
 mov r9,MB_ICONERROR+MB_OK
 xor r8,r8
 mov rdx,_error
 xor rcx,rcx
 call [MessageBox]
 add rsp,20h

end_loop:
 sub rsp,20h
 mov rcx,[msg.wParam]
 call [ExitProcess]
 add rsp,20h

WindowProc:
 ;----------- start prolog --------------
 ;----- save home register in the shadow space -------
 mov [rsp+8],rcx
 mov [rsp+16],rdx
 mov [rsp+24],r8
 mov [rsp+32],r9

 push rbp  ; <------ it let align stack to 0(MOD 16)

 reglist equ rbx rdi rsi r13
 count@regs = 4

 ;------ save non volatile register ------------------
 irps reg,reglist { forward push reg    }
 size@stack = count@regs * 8
      
 lea rbp,[rsp+(((count@regs+1)*8))]
 ;---- now rbp point to top of stack ----
 ;---- nov [rbp+8] = RCX = [rsp+8]
 ;---- insert here local variables (takin advantage eventually of
 ;---- the previous 8(MOD16) odd alignment dued to odd stacks volatiles

 ;---- now align definitevely the stack to 0(MOD16)
 if (size@stack mod 16)
    sub rsp,8
 end if
;----------- end prolog --------------

 cmp  rdx,WM_DESTROY
 jz .wm_destroy
 cmp rdx,WM_CREATE
 jnz .defwndproc

.wm_create:
 ;----- useful to port code from 32 bit ---------
 ;----- wrapping the func 32 with ENTER /LEAVE -----
 enter 8*4,0         ;create shadow space to save caller useful info

 ;----- old 32 caller ----------------------------
 push 02020202h
 push 01010101h
 call .with_shadow_frame_func
 ;------------------------------------------------
 leave
 xor rax,rax
 inc rax
 jmp .epilog

;-------older 32 caller remains relatively untouched ----------
.with_shadow_frame_func:
 mov rax,[esp+8]
 mov rcx,[esp+16]
 ret 16         ;NOTE: RET 0 /RET 16 or RET imm16 has the same effects
;--------------------------------------------------
.wm_destroy:
 xor rcx,rcx
 call [PostQuitMessage]
 xor rax,rax
 jmp .epilog

.defwndproc:
 sub rsp,20h
 call [DefWindowProc]
 add rsp,20h

.epilog:
 lea rsp,[rbp-(((count@regs+1)*8))]
 match reglist,reglist {
 irps reg,reglist \{ reverse pop reg     \}
}
 pop rbp
 ret 0

section '.idata' import data readable writeable
  library kernel32,'KERNEL32.DLL',\
   user32,'USER32.DLL'

  include 'api\kernel32.inc'
  include 'api\user32.inc'

..
Cheers,
hopcode

07 Feb 2010, 02:06

bitRAKE

Joined: 21 Jul 2003
Posts: 4249
Location: vpcmpistri

bitRAKE 07 Feb 2010, 02:34

Bah, supporting the shadow space is required by the API, but the stack unwinding is optional. The documentation should be corrected as follows:

Quote:

Every function that [clip/] uses table-based exception handling must have a prolog whose address limits are described in the unwind data associated with the respective function table entry (see Exception Handling (x64)).

07 Feb 2010, 02:34

hopcode

Joined: 04 Mar 2008
Posts: 563
Location: Germany

hopcode 07 Feb 2010, 03:14

Quote:

Every function that [clip/] ...

I hope you are right. Anyway this doesnt make the life easier to developers that coded manually their 32bit SEH. If one use a HLL compiler for a whole HLL-written project, i think that there is not so much difference, when porting to64bit as in the other case. But the other case leads me to think about the unofficial explanation of this 64bit fastcall way...
.
.

07 Feb 2010, 03:14

bitRAKE

Joined: 21 Jul 2003
Posts: 4249
Location: vpcmpistri

bitRAKE 07 Feb 2010, 04:15

Don't take my word for it: look at WIN64 compatible projects compiled with non-MS compilers using AddVectoredExceptionHandler. The table-based exception handling (TEH) can work with VEH - just as SEH worked with VEH. IMHO, assembly should use VEH and then fall-back to the VEH/(SEH/TEH) of the HLL or OS. (SEH always was a (cool) hack. Very Happy

)

07 Feb 2010, 04:15

hopcode

Joined: 04 Mar 2008
Posts: 563
Location: Germany

hopcode 09 Feb 2010, 11:29

bitRAKE wrote:

...IMHO, assembly should use VEH and then fall-back to the VEH/(SEH/TEH...

Anyway, not so simple:

Quote:

...In previous Windows Operation System, almost all system functions have standard stack frame. The first instruction of these function is "PUSH EBP" and its opcode is 55h, that is 1 byte. But in Windows XP SP2, the standard prolog code for SEH stack frame of the system function have been replaced with _SEH_prolog function...

Excerpt from the right example i needed (even if it is not up to date) to show how difficoult is, even for gurus, to match M$ tough standards
http://www.codeproject.com/KB/system/VEH.aspx
I intended this: porting64 means not a simply 64bit whole rewriting, in every case, even if from scratch.
(i am not yet off-topic) Very Happy

Cheers,
hopcode

09 Feb 2010, 11:29

bitRAKE

Joined: 21 Jul 2003
Posts: 4249
Location: vpcmpistri

bitRAKE 09 Feb 2010, 19:08

Programmatically, single-stepping is inherently not portable. Matt Pietrek's code does not know what byte size instruction was skipped; and does nothing for larger instructions -- it is very specific to the implementation he is testing it on:

Code:

if ( pExceptionInfo->ExceptionRecord->ExceptionAddress
     != (PVOID)((DWORD_PTR)g_pfnLoadLibraryAddress+1) ) 
{

These assumptions of the API implementation are far from the documentation of the interface - this is obviously wrong today (demonstrated in the codeproject article). And it has nothing to do with 64-bit portability nor exception handling -- it was a quick hack to write a cool article (at the time).

09 Feb 2010, 19:08

hopcode

Joined: 04 Mar 2008
Posts: 563
Location: Germany

hopcode 11 Feb 2010, 00:40

Yes,as you prefer,he couldnt have known the new specific API implement...,
but that is right my point on the example: New system new rules.
Every demo app that expects M$ 64bit conformant prolog/epilog and dont find it there will obviously crash.
Same song, inverted refrain.

I hope you could be right, but
what has been changed is just right that API internal and stack-ing
implement and that is really much, much more HLL... SEH/VEH... 64bit related. That has a lot to do when one writes a proc manually, thinking at only one context.

Now, i didnt know the following code till yesterday browsing this stuff and excerpting it from other party
I would report this unwind/SEH enabled code

Code:

PROC_FRAME      sample
    db          0x48            ; emit a REX prefix to enable hot-patching
    push        rbp             ; save prospective frame pointer
    [pushreg    rbp]            ; create unwind data for this rbp register push
    sub         rsp,0x40        ; allocate stack space
    [allocstack 0x40]           ; create unwind data for this stack allocation
    lea         rbp,[rsp+0x20]  ; assign the frame pointer with a bias of 32
    [setframe   rbp,0x20]       ; create unwind data for a frame register in rbp
    movdqa      [rbp],xmm7      ; save a non-volatile XMM register
    [savexmm128 xmm7, 0x20]     ; create unwind data for an XMM register save
    mov         [rbp+0x18],rsi  ; save rsi
    [savereg    rsi,0x38]       ; create unwind data for a save of rsi
    mov         [rsp+0x10],rdi  ; save rdi
    [savereg    rdi, 0x10]      ; create unwind data for a save of rdi
[endprolog]

    ; We can change the stack pointer outside of the prologue because we
    ; have a frame pointer.  If we didn't have one this would be illegal.
    ; A frame pointer is needed because of this stack pointer modification.

    sub         rsp,0x60        ; we are free to modify the stack pointer
    mov         rax,0           ; we can unwind this access violation
    mov         rax,[rax]

    movdqa      xmm7,[rbp]      ; restore the registers that weren't saved
    mov         rsi,[rbp+0x18]  ; with a push; this is not part of the
    mov         rdi,[rbp-0x10]  ; official epilog

    lea         rsp,[rbp-0x20]  ; This is the official epilog
    pop         rbp
    ret
ENDPROC_FRAME

and yes, that is the yes-asm, with its one for all proc macro facilities.
Now, related to my cases:
- i will not rewrite nor use the proc64 macro, sure.
the proc64 is too much opaque at the moment, even to an ex-macro-man like me. Very Happy

- i will not use enter/leave to stack a frame in a 64bit frame function, i promise. Very Happy

the enter/leave doesnt follow M$ standard stack/API implement, especially when used
in to-be-unwinded epilogs.

With few more bytes in prolog/epilog you give anyway the system the possibility to
unwind automatically the crashing of the most well written-and perfectest code.
If this is not 64bit SEH related Exclamation

(But i do not claim that the proc64 macro need a fix or an extended version
only because it is not fully related to M$ 64bit or exception handling.)

Here the synthesys link about toughness/restrictions i said. chap. 14.2
http://www.tortall.net/projects/yasm/manual/html/manual.html#win64-calling-convention

And they do not hope what i hope (in fact the chapter title is 14.2 Structured Exception Handling)

Quote:

Most functions that make use of the stack in 64-bit versions of Windows
must support exception handling even if they make no internal use of such facilities.

It is always the same most/must logical masking operation...
Very Happy

Cheers,
hopcode

11 Feb 2010, 00:40

bitRAKE

Joined: 21 Jul 2003
Posts: 4249
Location: vpcmpistri

bitRAKE 11 Feb 2010, 03:50

Thank you, that was a good read - I hadn't spent any time with the YASM manual before - very impressive work. Honestly, I can only speak of my experience over a very short period of time programming in Win64. The API interface seems clear to me, and the table based exception support is optional. I've stepped through some API functions and have a fair grasp of their assumptions at the instruction level.

Having automated tools to dissect/unwind a program is a good idea, but I'm not willing to support this thin veil of protection at this time. The use of symbolic abstraction within FASM should leave me open to future change if needed (i.e. overloading the ENTER/LEAVE instructions). It would be a worthy project to create macros for table-based exception support.

The future can have designs against our efforts, but we will adapt. It is luck if the future greets us without compromise. I will no doubt welcome luck and prepare for compromise. Very Happy

11 Feb 2010, 03:50

< Last Thread | Next Thread >

Forum Rules:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum