Iaaa
Joined: 27 Mar 2008
Posts: 6
|
Hallo all,
I have done some part of my code and want to share the experience in calling the win x64 functions. I hope it will be useful for some of you, who want to manually call the any x64 native code. And I have prepared same post for Linux x64 calling convention (and it's completely different than win and more interesting) /will add url/.
So. There are a lot of resources about win64 calling, but information is quite confusing and not so clear as expected. I'll try to summarize the all calling aspects and put example code that covers all cases.
Let's describe the task: I want to have a function that can call any win64 native function with up to any count of any typed arguments. Including floating point (single and double precision) arguments and floating point function results.
Choose the prototype of our universal calling function:
long x64_call(long argv[], long argc, void* function, long type);
// long is 64-bit machine word, void* is 64-bit pointer; fyi
where
* argv: array of arguments, independent of argument type
* argc: count of arguments in argv array
* function: calling function
* type: function returning type, 0 means integer, 1 means float and 2 means double
Is it required to describe why we use 64-bit words as every argument? 8-bit char, 16-bit word, 32-bit word, 64-bit word always sends as full 64-bit word with possible garbage in unused register part. 32-bit floats and 64-bit doubles sends as full 64-bit word. Ok?
And now the main information about calling before I post function code:
1 First 4 arguments always sending over registers. It doesn't matter what type of argument is - integer, float or pointer. Only 4 (four) arguments, not 3 and not 5.
2 This first fixed point arguments (all but floats) putting in rcx, rdx, r8 and r9 respectively.
3 First floating point arguments putting in xmm0, xmm1, xmm2 and xmm3.
4 All other arguments (if exists any) always sending via stack in right-to-left order and it doesn't matter of type of argument is, again.
5 Before calling the function you must subtract 4 64-bit words from stack pointer ( sub rsp, 32). It's stupid for me, but it's true.
7 Stack must be freed by caller (callee must return with same rsp as started). In other words before and after "call" instruction rsp must be same.
8 rcx, rdx, r8, r9, r10, r11, xmm0, xmm1, xmm2, xmm3 is volatile (can be destroyed by called function.)
9 And last but first - before calling the function your stack must be aligned for 16 bytes boundary (128-bit).
Again, only 4 (four) first arguments sending over registers. It means that if you want to send (int,int,float,int,float) then anyway only first 4 arguments will be "registered" and 5th argument must be sent over stack.
If we try to send (float,float,float,float,int) then we need to fill xmm0-xmm3 and 5th argument must be sent over stack. It doesn't matter that we have 4 unused registers (rcx,rdx,r8,r9), win64 don't care this resources and speed wasting. Clear?
That's all for requirements. Now I post the function source for you to check how it really works and not only talks.
x64_call: ; argv(rcx), argc(rdx), function(r8), type(r9)
; standard function entry
push rbp
mov rbp, rsp
push r9 ; save the type for feature use
and rsp, -16 ; ! align the stack for 128-bit boundary
mov rax, rdx
; compare the argument count to 4
; if less or equal do not push any data to stack
lea r10, [rcx + rax*8 - 8]
sub rax, 4
jbe .4
; realign the stack depending the odd/even count of pushing arguments
; a bit ugly, but works
mov rdx, rax
and rdx, 1
lea rsp, [rsp + rdx*8 - 16]
.1: ; push rest (more than 4) arguments to the stack in right-to-left order
push qword [r10]
sub r10, 8
dec rax
jnz .1
.4: ; put all first 4 arguments to the registers independing of
; it real count and type
; comparing the actual number of arguments and skipping unnecessary
; arguments will be a lot slower, realy
mov rax, r8 ; save function address
movsd xmm3, [rcx+24]
mov r9, [rcx+24]
movsd xmm2, [rcx+16]
mov r8, [rcx+16]
movsd xmm1, [rcx+ 8]
mov rdx, [rcx+ 8]
movsd xmm0, [rcx+ 0]
mov rcx, [rcx+ 0]
; required! please make the window in stack before calling function
sub rsp, 32 ;
call rax ; call the function
; check is function must return floating
cmp byte [rbp-8], 1 ; float
je .51
cmp byte [rbp-8], 2 ; double
je .52
; return any typed result in rax, not xmm0
.9: leave
ret
.51:
; converting float to double
cvtss2sd xmm0, xmm0
.52:
; copying the double to the regilar rax
movsd [rsp], xmm0
pop rax
jmp .9
And full sample code to test the wide range of arguments count. "d" means integer value in printf output, "f" means floating point output and "s" means string argument (0 byte ending). In case of floating point argument printf require "double", not "float" value to be passed. High-level compilers does this conversion automatically for you.
format PE64 CONSOLE
entry start
section '.text' code readable executable
x64_call: ; argv(rcx), argc(rdx), function(r8), type(r9)
; standard function entry
push rbp
mov rbp, rsp
push r9 ; save the type for feature use
and rsp, -16 ; ! align the stack for 128-bit boundary
mov rax, rdx
; compare the argument count to 4
; if less or equal do not push any data to stack
lea r10, [rcx + rax*8 - 8]
sub rax, 4
jbe .4
; realign the stack depending the odd/even count of pushing arguments
; a bit ugly, but works
mov rdx, rax
and rdx, 1
lea rsp, [rsp + rdx*8 - 16]
.1: ; push rest (more than 4) arguments to the stack in right-to-left order
push qword [r10]
sub r10, 8
dec rax
jnz .1
.4: ; put all first 4 arguments to the registers independing of
; it real count and type
; comparing the actual number of arguments and skipping unnecessary
; arguments will be a lot slower, realy
mov rax, r8 ; save function address
movsd xmm3, [rcx+24]
mov r9, [rcx+24]
movsd xmm2, [rcx+16]
mov r8, [rcx+16]
movsd xmm1, [rcx+ 8]
mov rdx, [rcx+ 8]
movsd xmm0, [rcx+ 0]
mov rcx, [rcx+ 0]
; required! please make the window in stack before calling function
sub rsp, 32 ;
call rax ; call the function
; check is function must return floating
cmp byte [rbp-8], 1 ; float
je .51
cmp byte [rbp-8], 2 ; double
je .52
; return any typed result in rax, not xmm0
.9: leave
ret
.51:
; converting float to double
cvtss2sd xmm0, xmm0
.52:
; copying the double to the regilar rax
movsd [rsp], xmm0
pop rax
jmp .9
start:
and rsp, -16
lea rbx, [_args]
.1:
cmp qword [rbx], 0
jz .2
mov rcx, [rbx]
mov rdx, [rbx+8]
mov rdx, [rdx]
mov r8, [printf]
mov r9, 0
call x64_call
add rbx, 16
jmp .1
.2:
xor rcx, rcx
call [ExitProcess]
section '.data' data readable writeable
_format1 db "1> d:%d",13,10,0
_args1 dq _format1, 1
_argc1 dq (_argc1-_args1)/8
_format2 db "2> d:%d, f:%f",13,10,0
_args2 dq _format2, 1, 2.2
_argc2 dq (_argc2-_args2)/8
_format3 db "3> d:%d, d:%d, d:%d, d:%d",13,10,0
_args3 dq _format3, 1, 2, 3, 4
_argc3 dq (_argc3-_args3)/8
_format4 db "4> f:%f, f:%f, f:%f, f:%f",13,10,0
_args4 dq _format4, 1.1, 2.2, 3.3, 4.4
_argc4 dq (_argc4-_args4)/8
_format5 db "5> f:%f, d:%d, f:%f, d:%d",13,10,0
_args5 dq _format5, 1.1, 2, 3.3, 4
_argc5 dq (_argc5-_args5)/8
_format6 db "6> f:%f, f:%f, f:%f, f:%f, f:%f, f:%f, f:%f, f:%f, f:%f",13,10,0
_args6 dq _format6, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9
_argc6 dq (_argc6-_args6)/8
_format7 db "7> d:%d, d:%d, d:%d, d:%d, d:%d, d:%d, d:%d, d:%d, d:%d",13,10,0
_args7 dq _format7, 1, 2, 3, 4, 5, 6, 7, 8, 9
_argc7 dq (_argc7-_args7)/8
_format8 db "8> d:%d, d:%d, d:%d, d:%d, d:%d, d:%d, d:%d, d:%d, d:%d",13,10
db " f:%f, f:%f, f:%f, f:%f, f:%f, f:%f, f:%f, f:%f, f:%f",13,10,0
_args8 dq _format8, 1, 2, 3, 4, 5, 6, 7, 8, 9
dq 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9
_argc8 dq (_argc8-_args8)/8
_format9 db "9> d:%d, f:%f, d:%d, f:%f, d:%d, f:%f, d:%d, f:%f, d:%d",13,10
db " f:%f, d:%d, f:%f, d:%d, f:%f, d:%d, f:%f, d:%d, f:%f",13,10
db " s:%s",13,10
db " f:%f, d:%d, f:%f, d:%d, f:%f, d:%d, f:%f, d:%d, f:%f",13,10
db " d:%d, f:%f, d:%d, f:%f, d:%d, f:%f, d:%d, f:%f, d:%d",13,10
db "done.",13,10,0
_string9 db "in the middle of 38 arguments list",0
_args9 dq _format9, 1, 2.2, 3, 4.4, 5, 6.6, 7, 8.8, 9
dq 1.1, 2, 3.3, 4, 5.5, 6, 7.7, 8, 9.9
dq _string9
dq 9.9, 8, 7.7, 6, 5.5, 4, 3.3, 2, 1.1
dq 9, 8.8, 7, 6.6, 5, 4.4, 3, 2.2, 1
_argc9 dq (_argc9-_args9)/8
_args dq _args1, _argc1
dq _args2, _argc2
dq _args3, _argc3
dq _args4, _argc4
dq _args5, _argc5
dq _args6, _argc6
dq _args7, _argc7
dq _args8, _argc8
dq _args9, _argc9
dq 0
section '.idata' import data readable writeable
dd 0,0,0,RVA kernel_name,RVA kernel_table
dd 0,0,0,RVA msvcrt_name,RVA msvcrt_table
dd 0,0,0,0,0
kernel_table:
ExitProcess dq RVA _ExitProcess
dq 0
msvcrt_table:
printf dq RVA _printf
dq 0
kernel_name db 'KERNEL32.DLL',0
msvcrt_name db 'MSVCRT.DLL',0
_ExitProcess dw 0
db 'ExitProcess',0
_printf dw 0
db 'printf',0
Execution results. It works as you can see. Executed under 64-bit win10.
1> d:1
2> d:1, f:2.200000
3> d:1, d:2, d:3, d:4
4> f:1.100000, f:2.200000, f:3.300000, f:4.400000
5> f:1.100000, d:2, f:3.300000, d:4
6> f:1.100000, f:2.200000, f:3.300000, f:4.400000, f:5.500000, f:6.600000, f:7.700000, f:8.800000, f:9.900000
7> d:1, d:2, d:3, d:4, d:5, d:6, d:7, d:8, d:9
8> d:1, d:2, d:3, d:4, d:5, d:6, d:7, d:8, d:9
f:1.100000, f:2.200000, f:3.300000, f:4.400000, f:5.500000, f:6.600000, f:7.700000, f:8.800000, f:9.900000
9> d:1, f:2.200000, d:3, f:4.400000, d:5, f:6.600000, d:7, f:8.800000, d:9
f:1.100000, d:2, f:3.300000, d:4, f:5.500000, d:6, f:7.700000, d:8, f:9.900000
s:in the middle of 38 arguments list
f:9.900000, d:8, f:7.700000, d:6, f:5.500000, d:4, f:3.300000, d:2, f:1.100000
d:9, f:8.800000, d:7, f:6.600000, d:5, f:4.400000, d:3, f:2.200000, d:1
done.
Any feedback, comments, questions?
|