flat assembler
Message board for the users of flat assembler.

Index > Windows > Win64 calling convention summary with example

Author
Thread Post new topic Reply to topic
Iaaa



Joined: 27 Mar 2008
Posts: 6
Iaaa 09 Dec 2016, 11:21
Hallo all,

I have done some part of my code and want to share the experience in calling the win x64 functions. I hope it will be useful for some of you, who want to manually call the any x64 native code. And I have prepared same post for Linux x64 calling convention (and it's completely different than win and more interesting) /will add url/.

So. There are a lot of resources about win64 calling, but information is quite confusing and not so clear as expected. I'll try to summarize the all calling aspects and put example code that covers all cases.

Let's describe the task: I want to have a function that can call any win64 native function with up to any count of any typed arguments. Including floating point (single and double precision) arguments and floating point function results.

Choose the prototype of our universal calling function:
Code:
long x64_call(long argv[], long argc, void* function, long type);
// long is 64-bit machine word, void* is 64-bit pointer; fyi
    

where
* argv: array of arguments, independent of argument type
* argc: count of arguments in argv array
* function: calling function
* type: function returning type, 0 means integer, 1 means float and 2 means double

Is it required to describe why we use 64-bit words as every argument? 8-bit char, 16-bit word, 32-bit word, 64-bit word always sends as full 64-bit word with possible garbage in unused register part. 32-bit floats and 64-bit doubles sends as full 64-bit word. Ok?

And now the main information about calling before I post function code:
1 First 4 arguments always sending over registers. It doesn't matter what type of argument is - integer, float or pointer. Only 4 (four) arguments, not 3 and not 5.
2 This first fixed point arguments (all but floats) putting in rcx, rdx, r8 and r9 respectively.
3 First floating point arguments putting in xmm0, xmm1, xmm2 and xmm3.
4 All other arguments (if exists any) always sending via stack in right-to-left order and it doesn't matter of type of argument is, again.
5 Before calling the function you must subtract 4 64-bit words from stack pointer (sub rsp, 32). It's stupid for me, but it's true.
7 Stack must be freed by caller (callee must return with same rsp as started). In other words before and after "call" instruction rsp must be same.
8 rcx, rdx, r8, r9, r10, r11, xmm0, xmm1, xmm2, xmm3 is volatile (can be destroyed by called function.)
9 And last but first - before calling the function your stack must be aligned for 16 bytes boundary (128-bit).

Again, only 4 (four) first arguments sending over registers. It means that if you want to send (int,int,float,int,float) then anyway only first 4 arguments will be "registered" and 5th argument must be sent over stack.
If we try to send (float,float,float,float,int) then we need to fill xmm0-xmm3 and 5th argument must be sent over stack. It doesn't matter that we have 4 unused registers (rcx,rdx,r8,r9), win64 don't care this resources and speed wasting. Clear?

That's all for requirements. Now I post the function source for you to check how it really works and not only talks.

Code:
x64_call: ; argv(rcx), argc(rdx), function(r8), type(r9)
        ; standard function entry
        push  rbp
        mov   rbp, rsp

        push  r9       ; save the type for feature use
        and   rsp, -16 ; ! align the stack for 128-bit boundary

        mov   rax, rdx

        ; compare the argument count to 4
        ; if less or equal do not push any data to stack
        lea   r10, [rcx + rax*8 - 8]
        sub   rax, 4
        jbe   .4

        ; realign the stack depending the odd/even count of pushing arguments
        ; a bit ugly, but works
        mov   rdx, rax
        and   rdx, 1
        lea   rsp, [rsp + rdx*8 - 16]

.1:     ; push rest (more than 4) arguments to the stack in right-to-left order
        push qword [r10]
        sub  r10, 8
        dec  rax
        jnz  .1
.4:     ; put all first 4 arguments to the registers independing of
        ; it real count and type
        ; comparing the actual number of arguments and skipping unnecessary
        ; arguments will be a lot slower, realy
        mov  rax, r8 ; save function address
        movsd xmm3, [rcx+24]
        mov     r9, [rcx+24]
        movsd xmm2, [rcx+16]
        mov     r8, [rcx+16]
        movsd xmm1, [rcx+ 8]
        mov    rdx, [rcx+ 8]
        movsd xmm0, [rcx+ 0]
        mov    rcx, [rcx+ 0]
        ; required! please make the window in stack before calling function
        sub  rsp, 32 ;
        call rax    ; call the function

        ; check is function must return floating
        cmp  byte [rbp-8], 1 ; float
        je   .51
        cmp  byte [rbp-8], 2 ; double
        je   .52

        ; return any typed result in rax, not xmm0
.9: leave
        ret

.51:
        ; converting float to double
        cvtss2sd xmm0, xmm0
.52:
        ; copying the double to the regilar rax
        movsd [rsp], xmm0
        pop  rax
        jmp  .9
    


And full sample code to test the wide range of arguments count. "d" means integer value in printf output, "f" means floating point output and "s" means string argument (0 byte ending). In case of floating point argument printf require "double", not "float" value to be passed. High-level compilers does this conversion automatically for you.
Code:
format PE64 CONSOLE
entry start

section '.text' code readable executable
x64_call: ; argv(rcx), argc(rdx), function(r8), type(r9)
        ; standard function entry
        push  rbp
        mov   rbp, rsp

        push  r9       ; save the type for feature use
        and   rsp, -16 ; ! align the stack for 128-bit boundary

        mov   rax, rdx

        ; compare the argument count to 4
        ; if less or equal do not push any data to stack
        lea   r10, [rcx + rax*8 - 8]
        sub   rax, 4
        jbe   .4

        ; realign the stack depending the odd/even count of pushing arguments
        ; a bit ugly, but works
        mov   rdx, rax
        and   rdx, 1
        lea   rsp, [rsp + rdx*8 - 16]

.1:     ; push rest (more than 4) arguments to the stack in right-to-left order
        push qword [r10]
        sub  r10, 8
        dec  rax
        jnz  .1
.4:     ; put all first 4 arguments to the registers independing of
        ; it real count and type
        ; comparing the actual number of arguments and skipping unnecessary
        ; arguments will be a lot slower, realy
        mov  rax, r8 ; save function address
        movsd xmm3, [rcx+24]
        mov     r9, [rcx+24]
        movsd xmm2, [rcx+16]
        mov     r8, [rcx+16]
        movsd xmm1, [rcx+ 8]
        mov    rdx, [rcx+ 8]
        movsd xmm0, [rcx+ 0]
        mov    rcx, [rcx+ 0]
        ; required! please make the window in stack before calling function
        sub  rsp, 32 ;
        call rax    ; call the function

        ; check is function must return floating
        cmp  byte [rbp-8], 1 ; float
        je   .51
        cmp  byte [rbp-8], 2 ; double
        je   .52

        ; return any typed result in rax, not xmm0
.9: leave
        ret

.51:
        ; converting float to double
        cvtss2sd xmm0, xmm0
.52:
        ; copying the double to the regilar rax
        movsd [rsp], xmm0
        pop  rax
        jmp  .9

start:
        and rsp, -16

        lea rbx, [_args]
.1:
        cmp qword [rbx], 0
        jz  .2
        mov rcx, [rbx]
        mov rdx, [rbx+8]
        mov rdx, [rdx]
        mov r8, [printf]
        mov r9, 0
        call x64_call
        add rbx, 16
        jmp .1

.2:
        xor rcx, rcx
        call    [ExitProcess]

section '.data' data readable writeable

  _format1  db "1> d:%d",13,10,0
  _args1 dq _format1, 1
  _argc1 dq (_argc1-_args1)/8

  _format2  db "2> d:%d, f:%f",13,10,0
  _args2 dq _format2, 1, 2.2
  _argc2 dq (_argc2-_args2)/8

  _format3  db "3> d:%d, d:%d, d:%d, d:%d",13,10,0
  _args3 dq _format3, 1, 2, 3, 4
  _argc3 dq (_argc3-_args3)/8

  _format4  db "4> f:%f, f:%f, f:%f, f:%f",13,10,0
  _args4 dq _format4, 1.1, 2.2, 3.3, 4.4
  _argc4 dq (_argc4-_args4)/8

  _format5  db "5> f:%f, d:%d, f:%f, d:%d",13,10,0
  _args5 dq _format5, 1.1, 2, 3.3, 4
  _argc5 dq (_argc5-_args5)/8

  _format6  db "6> f:%f, f:%f, f:%f, f:%f, f:%f, f:%f, f:%f, f:%f, f:%f",13,10,0
  _args6 dq _format6, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9
  _argc6 dq (_argc6-_args6)/8

  _format7  db "7> d:%d, d:%d, d:%d, d:%d, d:%d, d:%d, d:%d, d:%d, d:%d",13,10,0
  _args7 dq _format7, 1, 2, 3, 4, 5, 6, 7, 8, 9
  _argc7 dq (_argc7-_args7)/8

  _format8  db "8> d:%d, d:%d, d:%d, d:%d, d:%d, d:%d, d:%d, d:%d, d:%d",13,10
            db "   f:%f, f:%f, f:%f, f:%f, f:%f, f:%f, f:%f, f:%f, f:%f",13,10,0
  _args8 dq _format8, 1, 2, 3, 4, 5, 6, 7, 8, 9
         dq           1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9
  _argc8 dq (_argc8-_args8)/8

  _format9  db "9> d:%d, f:%f, d:%d, f:%f, d:%d, f:%f, d:%d, f:%f, d:%d",13,10
            db "   f:%f, d:%d, f:%f, d:%d, f:%f, d:%d, f:%f, d:%d, f:%f",13,10
            db "   s:%s",13,10
            db "   f:%f, d:%d, f:%f, d:%d, f:%f, d:%d, f:%f, d:%d, f:%f",13,10
            db "   d:%d, f:%f, d:%d, f:%f, d:%d, f:%f, d:%d, f:%f, d:%d",13,10
            db "done.",13,10,0
  _string9  db "in the middle of 38 arguments list",0
  _args9 dq _format9, 1, 2.2, 3, 4.4, 5, 6.6, 7, 8.8, 9
         dq           1.1, 2, 3.3, 4, 5.5, 6, 7.7, 8, 9.9
         dq           _string9
         dq           9.9, 8, 7.7, 6, 5.5, 4, 3.3, 2, 1.1
         dq           9, 8.8, 7, 6.6, 5, 4.4, 3, 2.2, 1
  _argc9 dq (_argc9-_args9)/8

  _args dq _args1, _argc1
        dq _args2, _argc2
        dq _args3, _argc3
        dq _args4, _argc4
        dq _args5, _argc5
        dq _args6, _argc6
        dq _args7, _argc7
        dq _args8, _argc8
        dq _args9, _argc9
        dq 0

section '.idata' import data readable writeable

  dd 0,0,0,RVA kernel_name,RVA kernel_table
  dd 0,0,0,RVA msvcrt_name,RVA msvcrt_table
  dd 0,0,0,0,0

  kernel_table:
    ExitProcess dq RVA _ExitProcess
    dq 0
  msvcrt_table:
    printf dq RVA _printf
    dq 0

  kernel_name db 'KERNEL32.DLL',0
  msvcrt_name db 'MSVCRT.DLL',0

  _ExitProcess dw 0
    db 'ExitProcess',0
  _printf dw 0
    db 'printf',0
    


Execution results. It works as you can see. Executed under 64-bit win10.
Code:
1> d:1
2> d:1, f:2.200000
3> d:1, d:2, d:3, d:4
4> f:1.100000, f:2.200000, f:3.300000, f:4.400000
5> f:1.100000, d:2, f:3.300000, d:4
6> f:1.100000, f:2.200000, f:3.300000, f:4.400000, f:5.500000, f:6.600000, f:7.700000, f:8.800000, f:9.900000
7> d:1, d:2, d:3, d:4, d:5, d:6, d:7, d:8, d:9
8> d:1, d:2, d:3, d:4, d:5, d:6, d:7, d:8, d:9
   f:1.100000, f:2.200000, f:3.300000, f:4.400000, f:5.500000, f:6.600000, f:7.700000, f:8.800000, f:9.900000
9> d:1, f:2.200000, d:3, f:4.400000, d:5, f:6.600000, d:7, f:8.800000, d:9
   f:1.100000, d:2, f:3.300000, d:4, f:5.500000, d:6, f:7.700000, d:8, f:9.900000
   s:in the middle of 38 arguments list
   f:9.900000, d:8, f:7.700000, d:6, f:5.500000, d:4, f:3.300000, d:2, f:1.100000
   d:9, f:8.800000, d:7, f:6.600000, d:5, f:4.400000, d:3, f:2.200000, d:1
done.
    


Any feedback, comments, questions?
Post 09 Dec 2016, 11:21
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20451
Location: In your JS exploiting you and your system
revolution 09 Dec 2016, 11:33
Iaaa wrote:
5 Before calling the function you must subtract 4 64-bit words from stack pointer (sub rsp, 32). It's stupid for me, but it's true.
This is not entirely clear. What is actually happening is that you must have space on the stack for all arguments (minimum of 4), but only those arguments that are number 5 and up are actually put into the stack.

For example if you have 6 arguments you can do this:
Code:
sub rsp,6*8 ;make space for all six arguments
mov [rsp+5*8],arg6
mov [rsp+4*8],arg5
mov r9,arg4
mov r8,arg3
mov rdx,arg2
mov rcx,arg1
call function    
Also, this is assembly code so we can have more than one return value. E.g. RAX, RCX and RDX, or whatever we want to define. Obviously this won't apply to system calls, but internal code can use this.
Post 09 Dec 2016, 11:33
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8359
Location: Kraków, Poland
Tomasz Grysztar 09 Dec 2016, 11:46
revolution wrote:
This is not entirely clear. What is actually happening is that you must have space on the stack for all arguments (minimum of 4), but only those arguments that are number 5 and up are actually put into the stack.
You can find a hint of the possible purpose of this requirement in fasm's examples - in EXAMPLES/WIN64/TEMPLATE/TEMPLATE.ASM you may find this fragment:
Code:
proc WindowProc uses rbx rsi rdi, hwnd,wmsg,wparam,lparam

; Note that first four parameters are passed in registers,
; while names given in the declaration of procedure refer to the stack
; space reserved for them - you may store them there to be later accessible
; if the contents of registers gets destroyed. This may look like:
;       mov     [hwnd],rcx
;       mov     [wmsg],edx
;       mov     [wparam],r8
;       mov     [lparam],r9    

Iaaa: you can also find some discussions on these conventions in the old threads on this board, especially from the time when fasm's fastcall/proc macros were created.
Post 09 Dec 2016, 11:46
View user's profile Send private message Visit poster's website Reply with quote
Iaaa



Joined: 27 Mar 2008
Posts: 6
Iaaa 09 Dec 2016, 12:01
@Tomasz, thanks for link.

@revolution, SystemV ABI does not have requirement to have space for all arguments in stack. For me it's wasting the space.
Post 09 Dec 2016, 12:01
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20451
Location: In your JS exploiting you and your system
revolution 09 Dec 2016, 12:34
Iaaa wrote:
@revolution, SystemV ABI does not have requirement to have space for all arguments in stack. For me it's wasting the space.
I wasn't addressing the wastage or otherwise, I just wanted to clarify this Windows requirement, we are stuck with it. You can use other conventions internally within your own code if desired. Personally for my 64-bit code I don't usually use fastcall internally, it is a bit cumbersome and inconsistent for my taste.
Post 09 Dec 2016, 12:34
View user's profile Send private message Visit poster's website Reply with quote
Iaaa



Joined: 27 Mar 2008
Posts: 6
Iaaa 09 Dec 2016, 20:30
Ok.
Post 09 Dec 2016, 20:30
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.