flat assembler
Message board for the users of flat assembler.

Index > Windows > Losing lower 127:0 bits on ymm0-5?

Author
Thread Post new topic Reply to topic
Alphonso



Joined: 16 Jan 2007
Posts: 295
Alphonso 23 May 2011, 13:38
Reading the little documentation from MS it seems only kernel mode (drivers) should need to explicitly ask for xstates to be saved but running the code below ymm0-5 seem to get the lower bits zeroed presumably after a context switch. Would someone please confirm / explain.

W7 SP1 VHP64 i5-2500k
Code:
 format PE64 GUI 4.0
include 'win64a.inc'

;-----------------------------------------------
section '.text' code readable executable
;===============================================
        sub     rsp,5*8
        invoke  SetProcessAffinityMask,-1,1
        vmovdqu ymm0,[ymmreg]           ; set ymm registers
        vmovdqu ymm1,[ymmreg]
        vmovdqu ymm2,[ymmreg]
        vmovdqu ymm3,[ymmreg]
        vmovdqu ymm4,[ymmreg]
        vmovdqu ymm5,[ymmreg]
        vmovdqu ymm6,[ymmreg]
        vmovdqu ymm7,[ymmreg]
        vmovdqu ymm8,[ymmreg]
        vmovdqu ymm9,[ymmreg]
        vmovdqu ymm10,[ymmreg]
        vmovdqu ymm11,[ymmreg]
        vmovdqu ymm12,[ymmreg]
        vmovdqu ymm13,[ymmreg]
        vmovdqu ymm14,[ymmreg]
        vmovdqu ymm15,[ymmreg]

 @@:
        vmovdqu [TestYmm0],ymm0        ; save registers to mem
        vmovdqu [TestYmm1],ymm1
        vmovdqu [TestYmm2],ymm2
        vmovdqu [TestYmm3],ymm3
        vmovdqu [TestYmm4],ymm4
        vmovdqu [TestYmm5],ymm5
        vmovdqu [TestYmm6],ymm6
        vmovdqu [TestYmm7],ymm7
        vmovdqu [TestYmm8],ymm8
        vmovdqu [TestYmm9],ymm9
        vmovdqu [TestYmm10],ymm10
        vmovdqu [TestYmm11],ymm11
        vmovdqu [TestYmm12],ymm12
        vmovdqu [TestYmm13],ymm13
        vmovdqu [TestYmm14],ymm14
        vmovdqu [TestYmm15],ymm15

        xor     rbx,rbx
        mov     rdi,Buff
        xor     rsi,rsi
.NextYmm:
        invoke  wsprintf,rdi,wsformat,rbx,qword[rsi+TestYmm0],qword[rsi+TestYmm0+8h],qword[rsi+TestYmm0+10h],qword[rsi+TestYmm0+18h]
        inc     rbx
        add     rdi,formatlen
        add     rsi,64
        cmp     bl,16
        jb      .NextYmm

        invoke  MessageBox,0,Buff,Tit,MB_YESNO
        cmp     eax,IDNO
        jne     @b                                  ; reload ymm0-15 into memory and lose lower bits of ymm0-5 !!!

 exit:

        invoke  ExitProcess,0

;-----------------------------------------------
section '.data' data readable writeable
;===============================================
  ymmreg:       dq 1111222233334444h,5555666677778888h,99990000aaaabbbbh,0ccccddddeeeeffffh
  rept 16 n:0 { TestYmm#n:    rb 64 }
  Tit           db 'YMM Test',0
  wsformat      db 'ymm%2u',9,'%016I64X%016I64X%016I64X%016I64X',10,10
  formatlen=72
                db 10,'Reload YMM values ?',0
  Buff          rb 400h

;----------------------------------------------
section '.idata' import data readable writeable
;===============================================

     library kernel32,'KERNEL32.DLL',\
               user32,'USER32.DLL'

             include 'api\kernel32.inc'
             include 'api\user32.inc'     


Last edited by Alphonso on 26 May 2011, 10:55; edited 1 time in total
Post 23 May 2011, 13:38
View user's profile Send private message Reply with quote
bitshifter



Joined: 04 Dec 2007
Posts: 796
Location: Massachusetts, USA
bitshifter 23 May 2011, 16:17
Sorry i dont have modern Windoze machine to try...

Anyway, IIRC wsprintf is cdecl
invoke is for stdcall where cinvoke is for cdecl
invoke will not cleanup the stack like cinvoke does...
Code:
invoke wsprintf
    

Not very important in this case, but just thought i would mention it...
Post 23 May 2011, 16:17
View user's profile Send private message Reply with quote
Alphonso



Joined: 16 Jan 2007
Posts: 295
Alphonso 23 May 2011, 17:13
I wondered about that. Do you mean cinvoke should still be used for "correctness"?

Disassembled code section of invoke wsprintf...
Code:
 sub     rsp,64                  
 mov     rcx,rdi                 
 mov     rdx,wsformat            
 mov     r8,rbx                  
 mov     r9,[rsi+TestYmm0+rsi]   
 mov     rax,[rsi+TestYmm0+8h]   
 mov     [rsp+20H],rax           
 mov     rax,[rsi+TestYmm0+10h]  
 mov     [rsp+28H],rax           
 mov     rax,[rsi+TestYmm0+18h]  
 mov     [rsp+30H],rax           
 call    near [rel imp_wsprintfA]
 add     rsp, 64                     


In this case the code produced is the same whether cinvoke or invoke is used.
Post 23 May 2011, 17:13
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 23 May 2011, 18:02
Alphonso, perhaps wsprintf itself is destroying your registers? I think the calling convention allows some SSE registers to be destroyed, and if I also remember right, writing to a XMM register clears the upper 128 bits of its YMM counterpart (so even if the calling convention disallows destroying SSE registers, using them and restoring before returning would still clear the YMM registers).

To test this better, perhaps you should try with LOOP $ just above @@: using a high enough RCX to give you enough time to launch something else also using YMM register (a second instance of this very same program for instance, but you'll need to initialize ymmreg with random values and then add code to check for difference)
[edit]
http://www.agner.org/optimize/calling_conventions.pdf wrote:
A preliminary ABI published by Intel (see literature p. 53) is supported by operating systems
and compilers. The YMM registers do not have callee-save status, except for the lower half
of YMM6-YMM15 in 64-bit Windows, where XMM6-XMM15 have callee-save status.
Possible future extensions of the vector registers to 512 bits or more will not have calleesave
status.


Also, I've read a little of AVX, using SSE instruction (with SSE registers), won't modify the upper 128 bits of the YMM register, but when using VEX.128 enconding, preserving the upper half is possible.[/edit]
Post 23 May 2011, 18:02
View user's profile Send private message Reply with quote
Alphonso



Joined: 16 Jan 2007
Posts: 295
Alphonso 23 May 2011, 18:31
Maybe, I'll look further into it, thanks for intuitive feedback. It would be nice if it were documented which registers are volatile in that case. Makes me also wonder about 32-bit were only the first 8 are available.

EDIT: ^^ good find. I wonder how expensive xsave/xrstor is going to be.

Double Edit: Okay, wsprintf seems fine, must have been the msgbox, even adding a Sleep between 2 blocks of reading ymm corrupts but running a long loop with reg/dec and all threads busy seems okay so the context seem okay. Thanks Loco. Smile
Post 23 May 2011, 18:31
View user's profile Send private message Reply with quote
Enko



Joined: 03 Apr 2007
Posts: 676
Location: Mar del Plata
Enko 23 May 2011, 19:20
wsprintf is stdcall, its located in the winapi, not c standart library. cprintf is cinvoke from mvcrt.dll
Post 23 May 2011, 19:20
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 23 May 2011, 20:11
Quote:

wsprintf is stdcall, its located in the winapi,
Despite its location, it is still cdecl. This probably has to do with the fact that stdcall doesn't support varargs and that the processor doesn't support RETN with reg/mem operand (wouldn't matter actually as counting how many format specifiers were found isn't enough to know how much parameters were really passed)
http://msdn.microsoft.com/en-us/library/ms647550%28VS.85%29.aspx wrote:
Note: It is important to note that wsprintf uses the C calling convention (_cdecl), rather than the standard call (_stdcall) calling convention. As a result, it is the responsibility of the calling process to pop arguments off the stack, and arguments are pushed on the stack from right to left. In C-language modules, the C compiler performs this task.


[edit]
LocoDelAssembly wrote:

Also, I've read a little of AVX, using SSE instruction (with SSE registers), won't modify the upper 128 bits of the YMM register, but when using VEX.128 enconding, preserving the upper half is possible.[/edit]
I've said it wrong, when using VEX.128 encoding it is possible to **CLEAR**, not preserve the upper half of the YMM counterpart register.[/edit]
Post 23 May 2011, 20:11
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.