flat assembler
Message board for the users of flat assembler.
![]() |
Author |
|
fasmnewbie 25 Jan 2016, 19:14
Or perhaps I got the timer code wrong? Or may it's Win10 speed? Now I am not sure anymore.
|
|||
![]() |
|
l_inc 25 Jan 2016, 23:10
fasmnewbie
I'm not sure what exactly you want from the fstp st0 instruction, but it's not like you should kill an item in the FPU stack before you can use the top of stack. It's like popping a return address from the program stack and then wondering why ret doesn't work. Similarly fstp st0 kills the previously stored payload and unbalances the FPU stack, which leads to FPU underflow exceptions, which are in turn very costly (even if masked). So if you really for whatever unhealthy reason want to compare how fstp st0 performs compared to finit you should avoid unbalancing the stack by first putting something into it. Here's a bit saner measurement: Code: format PE64 GUI 4.0 entry start include 'win64a.inc' section '.text' code readable executable proc MyGetTickCount invoke QueryPerformanceCounter,perfCntrVal xor edx,edx mov eax,dword[perfCntrVal] mov ecx,dword[perfCntrRes] div ecx ret endp Measure_sine1: push rbp mov rbp,rsp sub rsp,512 fxsave [rsp] call MyGetTickCount push rax mov rax,qword 3.4 push rax mov ecx,$1000000 @@: finit fld qword[rsp] fsin fstp qword[rsp] add ecx,-1 jnz @B pop rax call MyGetTickCount pop rcx sub rax,rcx fxrstor [rsp] leave ret Measure_sine2: push rbp mov rbp,rsp sub rsp,512 fxsave [rsp] call MyGetTickCount push rax mov rax,qword 3.4 push rax mov ecx,$1000000 @@: fldz fstp st0 fld qword[rsp] fsin fstp qword[rsp] add ecx,-1 jnz @B pop rax call MyGetTickCount pop rcx sub rax,rcx fxrstor [rsp] leave ret start: sub rsp,5*8 invoke QueryPerformanceFrequency,perfCntrRes test rax,rax jnz @F invoke MessageBox,NULL,msgNS,titleErr,MB_OK or MB_ICONERROR jmp .exit @@: xor edx,edx mov eax,dword[perfCntrRes] mov ecx,1000000 ;resolution in usec div ecx test eax,eax jnz @F invoke MessageBox,NULL,msgTL,titleErr,MB_OK or MB_ICONERROR jmp .exit @@: mov dword[perfCntrRes],eax call Measure_sine1 mov rbx,rax call Measure_sine2 cinvoke wsprintf,strBuf,fmtStr,rbx,rax invoke MessageBox,NULL,strBuf,msgtitle,MB_OK .exit: invoke ExitProcess,0 section '.data' data readable writeable titleErr db 'Error',0 msgNS db 'High resolution performance counter is not supported',0 msgTL db 'High resolution performance counter is of very low resolution',0 msgtitle db 'Timings',0 fmtStr db 'sine1: %u usec; sine2: %u usec',0 align 16 perfCntrRes dq ? perfCntrVal dq ? value dd ? strBuf rb 800h section '.idata' data readable data import library kernel32,'kernel32.dll',\ user32,'user32.dll',\ advapi32,'advapi32.dll' include 'api\kernel32.inc' include 'api\user32.inc' include 'api\advapi32.inc' end data Though comparing fldz+fstp to finit by having fsin in the background isn't quite sane by itself. _________________ Faith is a superposition of knowledge and fallacy |
|||
![]() |
|
fasmnewbie 25 Jan 2016, 23:53
I_inc, Thanks for stopping by.
actually i was assuming the worse case scenario where the fpu stack is full. So my idea was that if I were to use just one stack in a routine, the safer and more economic way would be to save the stack first then pops off (fstp st0) only the top stack that I want to use rather than FINITing the entire stack (which I thought was slower). It's nothing to do with fsin actually. But anyway it now it gives the opposite result on my AMD machine. |
|||
![]() |
|
fasmnewbie 26 Jan 2016, 00:54
l_inc. I got what you mean. It was the exception itself that generates such 'extra' time on the FSTP when the stack is empty. I think the best and safer way isn't by using the FINIT either (as your code suggested) but rather use FFREE that won't produce any costly exception even when the stack is empty. It's fast too.
Thanks. |
|||
![]() |
|
l_inc 26 Jan 2016, 12:50
fasmnewbie
Quote: I think the best and safer way isn't by using the FINIT either (as your code suggested) but rather use FFREE that won't produce any costly exception even when the stack is empty. That's a better option. But you'd better check if you really can't replace fxsave with just fsave . In this case, you'd have both at the same time: the stored fpu state, as well as a clean new fpu state. P.S. OK. I probably get, why you picked fxsave: it has a much lower latency. _________________ Faith is a superposition of knowledge and fallacy |
|||
![]() |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2023, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.