flat assembler
Message board for the users of flat assembler.

Index > Main > Why is FINIT much faster than FSTP?

Author
Thread Post new topic Reply to topic
fasmnewbie



Joined: 01 Mar 2011
Posts: 555
fasmnewbie 25 Jan 2016, 19:05
My original idea is to clear only one FPU stack by using FSTP ST0 because the routine uses only 1 fpu stack. But the other alternative using the same code structure but with FINIT instead of FSTP clocks much faster. Why is that?

My platform:
Win10 64-bit
Intel Celeron N2840 2.16GHz (netbook)

The result went against everything I read about FINIT and optimization so far.

Can anybody verify this on your own machines? Here's the entire code so you can test it straight away. Thanks

Code:
format pe64 console
include 'win64axp.inc'

        cinvoke QPF,freq
        cinvoke QPC,begin      ;start timer
   ;*************************
        ;timing area

        mov rcx,10000          ;loop 10000 times
        mov rax,3.4
        .do:
        call sine
        ;call sine2
        loop .do

   ;*************************
        cinvoke QPC,finish     ;end timer
        push    0d0ah
        mov     rcx,rsp
        cinvoke printf
        add     rsp,8
        mov     rax,[finish]
        mov     rbx,[begin]
        sub     rax,rbx        ;elapsedMS=end-start
        fninit                 ;convert all to double
        fild    [freq]
        fstp    [freq]         
        push    rax
        fild    qword[rsp]     
        fmul    [million]      
        fld     [freq]
        fdivp   st1,st0        
        fdiv    [million]     
        fstp    qword[rsp]
        pop     rdx
        mov     rcx,fmt
        cinvoke printf
        cinvoke getchar
        invoke  exit,0

million dq 1000000.0
begin   dq 0
finish  dq 1
freq    dq 1
fmt     db '%.10fs',0


;------------------------------------
align 8
sine2:
        mov     r15,rsp
        sub     rsp,512
        and     rsp,-16
        fxsave  [rsp]      ;0.00193xx average
        finit              ;clear entire stack
        push    rax
        fld     qword[rsp]
        fsin
        fstp    qword[rsp]
        pop     rax
        fxrstor [rsp]
        mov     rsp,r15
        ret

;------------------------------------
align 8
sine:
        mov     r15,rsp
        sub     rsp,512
        and     rsp,-16
        fxsave  [rsp]      ;0.00276xxxx average???
        fstp    st0        ;clear 1 stack only
        push    rax
        fld     qword[rsp]
        fsin
        fstp    qword[rsp]
        pop     rax
        fxrstor [rsp]
        mov     rsp,r15
        ret


;----------- IMPORT TABLE ------------
data import
     library msvcrt,'msvcrt.dll',\
             kernel32,'kernel32.dll'
     import kernel32,\
            QPC,'QueryPerformanceCounter',\
            QPF,'QueryPerformanceFrequency'
     import msvcrt,\
            exit,'exit',\
            printf,'printf',\
            getchar,'getchar'
end data    
Post 25 Jan 2016, 19:05
View user's profile Send private message Visit poster's website Reply with quote
fasmnewbie



Joined: 01 Mar 2011
Posts: 555
fasmnewbie 25 Jan 2016, 19:14
Or perhaps I got the timer code wrong? Or may it's Win10 speed? Now I am not sure anymore.
Post 25 Jan 2016, 19:14
View user's profile Send private message Visit poster's website Reply with quote
l_inc



Joined: 23 Oct 2009
Posts: 881
l_inc 25 Jan 2016, 23:10
fasmnewbie
I'm not sure what exactly you want from the fstp st0 instruction, but it's not like you should kill an item in the FPU stack before you can use the top of stack. It's like popping a return address from the program stack and then wondering why ret doesn't work. Similarly fstp st0 kills the previously stored payload and unbalances the FPU stack, which leads to FPU underflow exceptions, which are in turn very costly (even if masked).

So if you really for whatever unhealthy reason want to compare how fstp st0 performs compared to finit you should avoid unbalancing the stack by first putting something into it. Here's a bit saner measurement:

Code:
format PE64 GUI 4.0
entry start

include 'win64a.inc'

section '.text' code readable executable

    proc MyGetTickCount
        invoke QueryPerformanceCounter,perfCntrVal
        xor edx,edx
        mov eax,dword[perfCntrVal]
        mov ecx,dword[perfCntrRes]
        div ecx
    ret
    endp

    Measure_sine1:
        push rbp
        mov rbp,rsp
        sub rsp,512
        fxsave [rsp]
            call MyGetTickCount
            push rax
                mov rax,qword 3.4
                push rax
                mov ecx,$1000000
                @@:
                    finit
                    fld  qword[rsp]
                    fsin
                    fstp qword[rsp]
                add ecx,-1
                jnz @B
                pop rax
            call MyGetTickCount
            pop rcx
            sub rax,rcx
        fxrstor [rsp]
        leave
    ret

    Measure_sine2:
        push rbp
        mov rbp,rsp
        sub rsp,512
        fxsave [rsp]
            call MyGetTickCount
            push rax
                mov rax,qword 3.4
                push rax
                mov ecx,$1000000
                @@:
                    fldz
                    fstp st0
                    fld  qword[rsp]
                    fsin
                    fstp qword[rsp]
                add ecx,-1
                jnz @B
                pop rax
            call MyGetTickCount
            pop rcx
            sub rax,rcx
        fxrstor [rsp]
        leave
    ret

    start:
    sub rsp,5*8
        invoke QueryPerformanceFrequency,perfCntrRes
        test rax,rax
        jnz @F
            invoke MessageBox,NULL,msgNS,titleErr,MB_OK or MB_ICONERROR
            jmp .exit
        @@:

        xor edx,edx
        mov eax,dword[perfCntrRes]
        mov ecx,1000000                 ;resolution in usec
        div ecx

        test eax,eax
        jnz @F
            invoke MessageBox,NULL,msgTL,titleErr,MB_OK or MB_ICONERROR
            jmp .exit
        @@:

        mov dword[perfCntrRes],eax

        call Measure_sine1
        mov rbx,rax
        
        call Measure_sine2

        cinvoke wsprintf,strBuf,fmtStr,rbx,rax
        invoke MessageBox,NULL,strBuf,msgtitle,MB_OK
    .exit:
    invoke ExitProcess,0

section '.data' data readable writeable
    titleErr                db 'Error',0
    msgNS                   db 'High resolution performance counter is not supported',0
    msgTL                   db 'High resolution performance counter is of very low resolution',0
    msgtitle                db 'Timings',0
    fmtStr                  db 'sine1: %u usec; sine2: %u usec',0

    align 16
    perfCntrRes             dq ?
    perfCntrVal             dq ?
    value                   dd ?
    strBuf                  rb 800h

section '.idata' data readable
    data import
        library kernel32,'kernel32.dll',\
                user32,'user32.dll',\
                advapi32,'advapi32.dll'

        include 'api\kernel32.inc'
        include 'api\user32.inc'
        include 'api\advapi32.inc'
    end data    


Though comparing fldz+fstp to finit by having fsin in the background isn't quite sane by itself.

_________________
Faith is a superposition of knowledge and fallacy
Post 25 Jan 2016, 23:10
View user's profile Send private message Reply with quote
fasmnewbie



Joined: 01 Mar 2011
Posts: 555
fasmnewbie 25 Jan 2016, 23:53
I_inc, Thanks for stopping by.

actually i was assuming the worse case scenario where the fpu stack is full. So my idea was that if I were to use just one stack in a routine, the safer and more economic way would be to save the stack first then pops off (fstp st0) only the top stack that I want to use rather than FINITing the entire stack (which I thought was slower). It's nothing to do with fsin actually.

But anyway it now it gives the opposite result on my AMD machine.
Post 25 Jan 2016, 23:53
View user's profile Send private message Visit poster's website Reply with quote
fasmnewbie



Joined: 01 Mar 2011
Posts: 555
fasmnewbie 26 Jan 2016, 00:54
l_inc. I got what you mean. It was the exception itself that generates such 'extra' time on the FSTP when the stack is empty. I think the best and safer way isn't by using the FINIT either (as your code suggested) but rather use FFREE that won't produce any costly exception even when the stack is empty. It's fast too.

Thanks.
Post 26 Jan 2016, 00:54
View user's profile Send private message Visit poster's website Reply with quote
l_inc



Joined: 23 Oct 2009
Posts: 881
l_inc 26 Jan 2016, 12:50
fasmnewbie
Quote:
I think the best and safer way isn't by using the FINIT either (as your code suggested) but rather use FFREE that won't produce any costly exception even when the stack is empty.

That's a better option. But you'd better check if you really can't replace fxsave with just fsave . In this case, you'd have both at the same time: the stored fpu state, as well as a clean new fpu state.

P.S. OK. I probably get, why you picked fxsave: it has a much lower latency.

_________________
Faith is a superposition of knowledge and fallacy
Post 26 Jan 2016, 12:50
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2023, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.