The heaviest difficulty under win x64 asm coding is keeping stack alignment at dqword (align 16).


1. At first, we must align stack at exe entry point.

why is rsp misaligned at start ?
kernel32.dll:
0000000078D59630 48894C2408		BaseProcessStart: mov [rsp+08],rcx ; put address of exe entrypoint into the stack
0000000078D59635 4883EC28		sub rsp,28 ; reserve 4 qwords of stack + stack align 16
0000000078D59639 41B908000000		mov r9d,00000008
0000000078D5963F 4C8D442430		lea r8,[rsp+30]
0000000078D59644 418D5101		lea edx,[r9+01]
0000000078D59648 48B9FEFFFFFFFFFFFFFF	mov rcx,FFFFFFFFFFFFFFFE ; -2=hThread of current process
0000000078D59652 FF151080FEFF		call qword [0000000078D41668] ; []=0000000078EF1330=ntdll.NtSetInformationThread
0000000078D59658 FF542430		call qword [rsp+30] ; call the entrypoint
0000000078D5965C 8BC8			mov ecx,eax ; here we return from executable
0000000078D5965E E8BDDB0000		call 0000000078D67220 ; KERNEL32.ExitThread
0000000078D59663 CC			int3
 
rsp is aligned 16 in kernel32.dll and call qword [rsp+30] calls exe entry point so rsp is 1 qword off 16-bytes alignment at exe entry point

knowing the above behaviour of kernel32.dll, we can make the smallest possible win64 executable:

start:
	xor	eax,eax		; return value = 0
	ret

If return value doesn't matter for you, then you can omit zeroing eax and you can make executable with only 1 instruction, only ret.

But back to alignment...

I personally like to do this step at exe entry point

start:
	sub	rsp,8*(4+11)

This perfectly alignes stack 16. As a benefit it leaves 4 qwords of stack space for API and 11 qwords for us.
This is the smallest possible instruction, it has only 4 bytes 48 83 EC 78. If you use bigger number, the instruction has 7 bytes. If you don't plane to call any API in the procedure start, then perhaps the smallest possible solution is e.g.

start:
	push	rax
	call	main
	pop	rax
	xor	eax,eax
	ret

main:

But then the task is to align stack at procedure main.


2. Aligning stack in procedures.

This is my preferred way. It has the disadvantage that you can't use push/pop instructions between proc_prologue_done and proc_epilogue. But do you really need pushes/pops when you have 15 registers ??? And if you really need push/pop then you can use mov [rsp+...],reg64 instead !!! This way has 1 small benefit: you can access stack using RSP register, you needn't RBP to do it, so you have 1 extra free register (RBP) !!!

proc:
; proc_prologue
	push	rcx rdx rbx rsi rdi r8 r9 r10 r11
a=1	; return address
b=9	; number of pushed registers
d=(sizeof.LV_ITEM64+7)/8
e=4	; number of qwords reserved for API
c=(a+b+d+e) and 1	; alignment at dqword
	sub	rsp,8*(c+d+e)
; proc_prologue_done

virtual at rsp+8*e
lvi_ccc	LV_ITEM64
end virtual

; the stack looks now like:
; a <- the top, contains return address from procedure
; b <- pushed registers
; c <- it is 1 qword or none depending on a,b,d,e and is used to align 16
; d <- LV_ITEM64 structure
; e <- 4 qwords reserved for API
;   <- current RSP

; instructions of your procedure...
; if you need to obtain RCX pushed at proc_prologue, use mov rcx,[rsp+8*(e+d+c+8)]
; if you need to obtain R11 pushed at proc_prologue, use mov rcx,[rsp+8*(e+d+c+0)]

; proc_epilogue
	add	rsp,8*(c+d+e)
	pop	r11 r10 r9 r8 rdi rsi rbx rdx rcx
	ret

That's all !

This way isn't easy, so I thought how to check the stack again, because we all are humans and we make mistakes !
So I had 4 ideas and combining 2 or 3 of them may rapidly reduce the risk of stack misalignment:
a) check the source code manually again
b) leave the program to be single stepped (utility fta16.exe) - usable only for small executables, it can be too slow for big files. Advantage - it scans everything thoroughly, so there isn't any possibility of undiscovering hidden misalignment !!!
Because Vista dlls are much huger than xp64 dlls, I strongly recommend to use XP64 and not to use Vista64. Checking simple MessageBoxA lasts about 1 minute under XP64 !!!
c) disassemble program and check disassembled output manually or by fxa16.exe (fdisasm.exe your_prog.exe fxa16.exe your_prog.d64) - note, you can use it for checking DLL too, but rename dll to exe at first (fdisasm checks exe extension of input file)
d) using testing instruction inside procedure for causing exception if misalignment, e.g.
movdqa dqword [rsp],xmm0
you can catch exceptions by debugger
some clever boy may think off macros so movdqa [],xmm is put only in developping stage and not in final (ready to release) compiling (e.g. simple adjusting testing_mode=1 testing_mode=0 ...)