flat assembler
Message board for the users of flat assembler.
Index
> Windows > [help] More on x64 16-byte stack alignment |
Author |
|
MazeGen 03 Feb 2010, 20:48
GoAsm uses this ugly, but working code:
http://www.jorgon.freeserve.co.uk/GoasmHelp/64bits.htm#invokec I don't know how to implement it using fasm macro facility though. |
|||
03 Feb 2010, 20:48 |
|
LocoDelAssembly 03 Feb 2010, 21:18
alorent, invoke already does proper alignment but it has the precondition that RSP is 16-byte aligned before using it. You forgot to add another "sub rsp,8" (or just "push rbp") at StartThread, so the precondition is broken here. (Probably you know this, but since all the other calls have four parameters it is not so clear whether you are talking about the non-real inability of invoke to align RSP with any amount of args or invoke's inability to work with random RSP alignment).
To implement what MazeGen posted add this after "include 'win32a.inc'" Code: macro invoke name, [args]{ common PUSH RSP ;save current RSP position on the stack PUSH qword [RSP] ;keep another copy of that on the stack AND SPL,0F0h ;adjust RSP to align the stack if not already there invoke name, args POP RSP ;restore RSP to its original value } Unless I'm missing something that should do it. |
|||
03 Feb 2010, 21:18 |
|
baldr 03 Feb 2010, 22:09
LocoDelAssembly,
Consider the following case: Code: ; addr: value ; rsp==10018: ? PUSH RSP ;save current RSP position on the stack ; rsp==10010: 10018 ; 10018: ? PUSH qword [RSP] ;keep another copy of that on the stack ; rsp==10008: 10018 ; 10010: 10018 ; 10018: ? AND SPL,0F0h ;adjust RSP to align the stack if not already there ; rsp==10000: ??? ; 10008: 10018 ; 10010: 10018 ; 10018: ? invoke name, args POP RSP ;restore RSP to its original value ; rsp==??? x86-64 calling conventions are quite confusing for me, to say the least. "The stack pointer must be aligned to 16 bytes, except for leaf functions, in any region of code that isn’t part of an epilog or prolog." from "Unwindability" clause on Overview of x64 Calling Conventions — I'm speechless. Always push/pop twice/in pairs? Does somebody know plausible explanation of this 16-byte-align decision? |
|||
03 Feb 2010, 22:09 |
|
hopcode 04 Feb 2010, 00:45
for me it works on Vista 64 SP2, CreateThread retval=90h
Code: ;on start : rsp 000000000006FF58 ;on call StartThread rsp 000000000006FF50 ;<----------- ;in StartThread rsp 000000000006FF48 ;on CreateThread rsp 000000000006FF18 00000000778CC1BF nop 00000000778CC1C0 KERNEL32.CreateThread: sub rsp,48 ;<------- 00000000778CC1C4 mov rax,[rsp+78] 00000000778CC1C9 mov [rsp+30],rax 00000000778CC1CE mov eax,[rsp+70] 00000000778CC1D2 mov [rsp+28],eax 00000000778CC1D6 mov [rsp+20],r9 00000000778CC1DB mov r9,r8 00000000778CC1DE mov r8,rdx 00000000778CC1E1 mov rdx,rcx 00000000778CC1E4 or rcx,FFFFFFFFFFFFFFFF 00000000778CC1E8 call 00000000778CC200 ; KERNEL32.CreateRemoteThread 00000000778CC1ED add rsp,48 00000000778CC1F1 ret ;after CreateThread rsp 000000000006FF18 ;on cmp eax,0 rsp 000000000006FF50 <------- |
|||
04 Feb 2010, 00:45 |
|
bitRAKE 04 Feb 2010, 03:57
My own code looks like this:
Code: entry $ sub rsp,.local enter .frame,0 virtual at rbp-.frame ; parameter space, 20 qwords maximum for byte offsets rq 1 ; rcx / xmm0L rq 1 ; rdx / xmm1L rq 1 ; r8 / xmm2L rq 1 ; r9 / xmm3L .4 rq 1 .5 rq 1 .6 rq 1 .7 rq 1 .8 rq 1 .9 rq 1 .A rq 1 .B rq 1 .C rq 1 .D rq 1 .E rq 1 .F rq 1 .G rq 1 .H rq 1 .I rq 1 .J rq 1 .frame = NOT 15 AND ($-$$+15) rb $$+.frame-$ ; stack alignment ; RBP is never restored .RBP rq 1 .ofn OPENFILENAME .file rw .file.. .file.. = 4096 .gms MEMORYSTATUS .pMemory rq 1 .iMemory rq 1 ; bytes .pFile rq 1 .iFile rq 1 ; bytes .pBlocks rq 1 .iBlocks rq 1 ; count .pHash rq 1 .iHash rq 1 ; bytes .local = NOT 15 AND ($-.RBP+7) rb .RBP+8+.local-$ ; stack alignment .RET0 rq 1 ; to KERNEL32.BaseThreadInitThunk rq 4 .RET1 rq 1 end virtual Code: .WM_CLOSE: enter 8*4,0 ; required shadow space ; hWnd in RCX xor edx,edx call [EndDialog] leave mov eax,1 retn |
|||
04 Feb 2010, 03:57 |
|
alorent 04 Feb 2010, 08:19
baldr wrote: LocoDelAssembly, baldr, thanks! You are right! I needs that extra "add rsp, 8" to make it work as expected!!! Thank you all guys for all your help! ANyway, it's strange the 16-byte requirement on 64-bit when the stack alignment should be the WORD-SIZE for that platform (8 bytes), like it's 4-byte for the 32-bit platform. |
|||
04 Feb 2010, 08:19 |
|
Madis731 04 Feb 2010, 11:54
@alorent:
You can pass xmm registers and xmm content on [rsp] and when a function is optimized with movdqA (aligned) you'll crash everytime this instruction is executed with non-(rsp=rsp and not 0xF) |
|||
04 Feb 2010, 11:54 |
|
hopcode 07 Feb 2010, 02:06
Some consideration about the model _fastcall:
(I hope i dont enter the off-topic here.) - Recent versions of MS 64bit OS are stack-aware. This is the reason i couldnt reproduce the error on CreateThread in the bad alignment. I think MS is encouraging developers to a port 64. bitRAKE wrote: Seems like a concise solution. Yes, i like it. But if i do not mistake it doesnt follow the MS standard here is the MS standard http://msdn.microsoft.com/en-us/library/tawsa7cb%28VS.80%29.aspx In few words: - One cannot use ENTER/LEAVE (the same applies to the epilog of the proc macro) - One must use in epilog add esp,size / lea - If you want MS and HLL compatibility the stack structure must follow this skemata: http://msdn.microsoft.com/en-us/library/ew5tede7%28VS.80%29.aspx - very important what Madis731 said (if one is about to write a proc macro) - important: the official indirect answer to baldr baldr wrote: Always push/pop twice/in pairs? M$ Prolog and Epilog wrote: The associated unwind data must describe the action of the prolog and must provide the information necessary to undo the effect of the prolog code. M$ Prolog and Epilog wrote: The epilog code must follow a strict set of rules for the unwind code to reliably unwind through exceptions and interrupts. This reduces the amount of unwind data required, because no extra data is needed to describe each epilog. Instead, the unwind code can determine that an epilog is being executed by scanning forward through a code stream to identify an epilog. Quote: The stack will always be maintained 16-byte aligned, except within the prolog (for example, after the return address is pushed), and except where indicated in Function Types for a certain class of frame functions. Then, reading bitRAKE i have got an ide how to port. code32 to code64. It has visible limitations of course. i have not tested it. Now here the hello world example, the un-macroed way. Apart from alloca, it should follow the MS standard. Code: format PE64 GUI 5.0 entry start include 'win64a.inc' include "..\macro\mrk_macro.inc" section '.data' data readable writeable _title TCHAR 'Win64 program template',0 _class TCHAR 'FASMWIN64',0 _error TCHAR 'Startup failed.',0 wc WNDCLASSEX sizeof.WNDCLASSEX,0,WindowProc,0,0,NULL,NULL,NULL,COLOR_BTNFACE+1,NULL,_class,NULL msg MSG section '.code' code readable executable start: sub rsp,8 sub rsp,20h xor rcx,rcx call [GetModuleHandle] add rsp,20h mov [wc.hInstance],rax sub rsp,20h mov rdx,IDI_APPLICATION xor rcx,rcx call [LoadIcon] add rsp,20h mov [wc.hIcon],rax mov [wc.hIconSm],rax sub rsp,20h mov rdx,IDC_ARROW xor rcx,rcx call [LoadCursor] add rsp,20h mov [wc.hCursor],rax sub rsp,20h mov rcx,wc call [RegisterClassEx] add rsp,20h test rax,rax jz error xor r10,r10 push 0 push [wc.hInstance] push 0 push 0 push 192 push 256 push 128 push 128 sub rsp,20h ;using more than 4 parameters push 4paras*8bytes mov r9,WS_VISIBLE+WS_DLGFRAME+WS_SYSMENU mov r8,_title mov rdx,_class xor rcx,rcx call [CreateWindowEx] add rsp,20h+40h ; restore 4paras*8bytes + 8paras*8bytes test rax,rax jz error msg_loop: sub rsp,20h xor r9,r9 xor r8,r8 xor rdx,rdx mov rcx,msg call [GetMessage] add rsp,20h cmp eax,1 jb end_loop jne msg_loop sub rsp,20h mov rcx,msg call [TranslateMessage] add rsp,20h sub rsp,20h mov rcx,msg call [DispatchMessage] add rsp,20h jmp msg_loop error: sub rsp,20h mov r9,MB_ICONERROR+MB_OK xor r8,r8 mov rdx,_error xor rcx,rcx call [MessageBox] add rsp,20h end_loop: sub rsp,20h mov rcx,[msg.wParam] call [ExitProcess] add rsp,20h WindowProc: ;----------- start prolog -------------- ;----- save home register in the shadow space ------- mov [rsp+8],rcx mov [rsp+16],rdx mov [rsp+24],r8 mov [rsp+32],r9 push rbp ; <------ it let align stack to 0(MOD 16) reglist equ rbx rdi rsi r13 count@regs = 4 ;------ save non volatile register ------------------ irps reg,reglist { forward push reg } size@stack = count@regs * 8 lea rbp,[rsp+(((count@regs+1)*8))] ;---- now rbp point to top of stack ---- ;---- nov [rbp+8] = RCX = [rsp+8] ;---- insert here local variables (takin advantage eventually of ;---- the previous 8(MOD16) odd alignment dued to odd stacks volatiles ;---- now align definitevely the stack to 0(MOD16) if (size@stack mod 16) sub rsp,8 end if ;----------- end prolog -------------- cmp rdx,WM_DESTROY jz .wm_destroy cmp rdx,WM_CREATE jnz .defwndproc .wm_create: ;----- useful to port code from 32 bit --------- ;----- wrapping the func 32 with ENTER /LEAVE ----- enter 8*4,0 ;create shadow space to save caller useful info ;----- old 32 caller ---------------------------- push 02020202h push 01010101h call .with_shadow_frame_func ;------------------------------------------------ leave xor rax,rax inc rax jmp .epilog ;-------older 32 caller remains relatively untouched ---------- .with_shadow_frame_func: mov rax,[esp+8] mov rcx,[esp+16] ret 16 ;NOTE: RET 0 /RET 16 or RET imm16 has the same effects ;-------------------------------------------------- .wm_destroy: xor rcx,rcx call [PostQuitMessage] xor rax,rax jmp .epilog .defwndproc: sub rsp,20h call [DefWindowProc] add rsp,20h .epilog: lea rsp,[rbp-(((count@regs+1)*8))] match reglist,reglist { irps reg,reglist \{ reverse pop reg \} } pop rbp ret 0 section '.idata' import data readable writeable library kernel32,'KERNEL32.DLL',\ user32,'USER32.DLL' include 'api\kernel32.inc' include 'api\user32.inc' .. Cheers, hopcode |
|||
07 Feb 2010, 02:06 |
|
bitRAKE 07 Feb 2010, 02:34
Bah, supporting the shadow space is required by the API, but the stack unwinding is optional. The documentation should be corrected as follows:
Quote: Every function that [clip/] uses table-based exception handling must have a prolog whose address limits are described in the unwind data associated with the respective function table entry (see Exception Handling (x64)). |
|||
07 Feb 2010, 02:34 |
|
hopcode 07 Feb 2010, 03:14
Quote: Every function that [clip/] ... I hope you are right. Anyway this doesnt make the life easier to developers that coded manually their 32bit SEH. If one use a HLL compiler for a whole HLL-written project, i think that there is not so much difference, when porting to64bit as in the other case. But the other case leads me to think about the unofficial explanation of this 64bit fastcall way... . . |
|||
07 Feb 2010, 03:14 |
|
bitRAKE 07 Feb 2010, 04:15
Don't take my word for it: look at WIN64 compatible projects compiled with non-MS compilers using AddVectoredExceptionHandler. The table-based exception handling (TEH) can work with VEH - just as SEH worked with VEH. IMHO, assembly should use VEH and then fall-back to the VEH/(SEH/TEH) of the HLL or OS. (SEH always was a (cool) hack. )
|
|||
07 Feb 2010, 04:15 |
|
hopcode 09 Feb 2010, 11:29
bitRAKE wrote: ...IMHO, assembly should use VEH and then fall-back to the VEH/(SEH/TEH... Anyway, not so simple: Quote: ...In previous Windows Operation System, almost all system functions have standard stack frame. The first instruction of these function is "PUSH EBP" and its opcode is 55h, that is 1 byte. But in Windows XP SP2, the standard prolog code for SEH stack frame of the system function have been replaced with _SEH_prolog function... Excerpt from the right example i needed (even if it is not up to date) to show how difficoult is, even for gurus, to match M$ tough standards http://www.codeproject.com/KB/system/VEH.aspx I intended this: porting64 means not a simply 64bit whole rewriting, in every case, even if from scratch. (i am not yet off-topic) Cheers, hopcode |
|||
09 Feb 2010, 11:29 |
|
bitRAKE 09 Feb 2010, 19:08
Programmatically, single-stepping is inherently not portable. Matt Pietrek's code does not know what byte size instruction was skipped; and does nothing for larger instructions -- it is very specific to the implementation he is testing it on:
Code: if ( pExceptionInfo->ExceptionRecord->ExceptionAddress != (PVOID)((DWORD_PTR)g_pfnLoadLibraryAddress+1) ) { |
|||
09 Feb 2010, 19:08 |
|
hopcode 11 Feb 2010, 00:40
Yes,as you prefer,he couldnt have known the new specific API implement...,
but that is right my point on the example: New system new rules. Every demo app that expects M$ 64bit conformant prolog/epilog and dont find it there will obviously crash. Same song, inverted refrain. I hope you could be right, but what has been changed is just right that API internal and stack-ing implement and that is really much, much more HLL... SEH/VEH... 64bit related. That has a lot to do when one writes a proc manually, thinking at only one context. Now, i didnt know the following code till yesterday browsing this stuff and excerpting it from other party I would report this unwind/SEH enabled code Code: PROC_FRAME sample db 0x48 ; emit a REX prefix to enable hot-patching push rbp ; save prospective frame pointer [pushreg rbp] ; create unwind data for this rbp register push sub rsp,0x40 ; allocate stack space [allocstack 0x40] ; create unwind data for this stack allocation lea rbp,[rsp+0x20] ; assign the frame pointer with a bias of 32 [setframe rbp,0x20] ; create unwind data for a frame register in rbp movdqa [rbp],xmm7 ; save a non-volatile XMM register [savexmm128 xmm7, 0x20] ; create unwind data for an XMM register save mov [rbp+0x18],rsi ; save rsi [savereg rsi,0x38] ; create unwind data for a save of rsi mov [rsp+0x10],rdi ; save rdi [savereg rdi, 0x10] ; create unwind data for a save of rdi [endprolog] ; We can change the stack pointer outside of the prologue because we ; have a frame pointer. If we didn't have one this would be illegal. ; A frame pointer is needed because of this stack pointer modification. sub rsp,0x60 ; we are free to modify the stack pointer mov rax,0 ; we can unwind this access violation mov rax,[rax] movdqa xmm7,[rbp] ; restore the registers that weren't saved mov rsi,[rbp+0x18] ; with a push; this is not part of the mov rdi,[rbp-0x10] ; official epilog lea rsp,[rbp-0x20] ; This is the official epilog pop rbp ret ENDPROC_FRAME and yes, that is the yes-asm, with its one for all proc macro facilities. Now, related to my cases: - i will not rewrite nor use the proc64 macro, sure. the proc64 is too much opaque at the moment, even to an ex-macro-man like me. - i will not use enter/leave to stack a frame in a 64bit frame function, i promise. the enter/leave doesnt follow M$ standard stack/API implement, especially when used in to-be-unwinded epilogs. With few more bytes in prolog/epilog you give anyway the system the possibility to unwind automatically the crashing of the most well written-and perfectest code. If this is not 64bit SEH related (But i do not claim that the proc64 macro need a fix or an extended version only because it is not fully related to M$ 64bit or exception handling.) Here the synthesys link about toughness/restrictions i said. chap. 14.2 http://www.tortall.net/projects/yasm/manual/html/manual.html#win64-calling-convention And they do not hope what i hope (in fact the chapter title is 14.2 Structured Exception Handling) Quote: Most functions that make use of the stack in 64-bit versions of Windows Cheers, hopcode |
|||
11 Feb 2010, 00:40 |
|
bitRAKE 11 Feb 2010, 03:50
Thank you, that was a good read - I hadn't spent any time with the YASM manual before - very impressive work. Honestly, I can only speak of my experience over a very short period of time programming in Win64. The API interface seems clear to me, and the table based exception support is optional. I've stepped through some API functions and have a fair grasp of their assumptions at the instruction level.
Having automated tools to dissect/unwind a program is a good idea, but I'm not willing to support this thin veil of protection at this time. The use of symbolic abstraction within FASM should leave me open to future change if needed (i.e. overloading the ENTER/LEAVE instructions). It would be a worthy project to create macros for table-based exception support. The future can have designs against our efforts, but we will adapt. It is luck if the future greets us without compromise. I will no doubt welcome luck and prepare for compromise. |
|||
11 Feb 2010, 03:50 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.