flat assembler
Message board for the users of flat assembler.

Index > Examples and Tutorials > fast/approximate special functions

Author
Thread Post new topic Reply to topic
tthsqe



Joined: 20 May 2009
Posts: 721
tthsqe
I think we should have some collection of functions for doing special functions (like exponential, logs, and even square roots and division) that is faster than the built in versions but not as accurate.

in these examples, fxn_N refers to a function that is supposedly accurate to N bits.
a function argument like _PdQ means the rational number P/Q.
a function argument like t_ means the 128 bit version of the 256 bit argument t.
Post 14 Jun 2013, 04:16
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 721
tthsqe
division
Code:
;a = b / c :
macro vdivpd_36 a,b,c,t,t_,_1 {
   vcvtpd2ps  t_,c
      vrcpps  t_,t_
   vcvtps2pd  t,t_
vfnmadd213pd  c,t,_1
      vmulpd  a,t,b
 vfmadd231pd  c,c,c
 vfmadd231pd  a,a,c
}

;a = b / c :
macro vdivpd_60 a,b,c,t,t_,_1 {
   vcvtpd2ps  t_,c
      vrcpps  t_,t_
   vcvtps2pd  t,t_
vfnmadd213pd  c,t,_1
      vmulpd  a,t,b
     vmovapd  t,_1 
 vfmadd231pd  t,c,c
 vfmadd231pd  c,c,c
      vmulpd  t,t,c
 vfmadd231pd  a,a,t
}                     
Post 14 Jun 2013, 07:03
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 721
tthsqe
reciprocal
Code:
; a = 1 / c :
macro vrcppd_36 a,a_,c,_1 {
   vcvtpd2ps  a_,c
      vrcpps  a_,a_
   vcvtps2pd  a,a_
vfnmadd213pd  c,a,_1
 vfmadd231pd  c,c,c
 vfmadd231pd  a,a,c
}

;a = 1 / c :
macro vrcppd_60 a,a_,c,t,_1 {
   vcvtpd2ps  a_,c
      vrcpps  a_,a_
   vcvtps2pd  a,a_
     vmovapd  t,_1 
vfnmadd213pd  c,a,_1
 vfmadd231pd  t,c,c
 vfmadd231pd  c,c,c
      vmulpd  t,t,c
 vfmadd231pd  a,a,t
}                
Post 14 Jun 2013, 07:05
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 721
tthsqe
division in Newton gravity calculation
Code:
; a = b / Sqrt[c] :
macro vdivsqrtpd_36 a,b,c,t,t_,_1,_1d2,_3d8 {
   vcvtpd2ps  t_,c
    vrsqrtps  t_,t_
   vcvtps2pd  t,t_
      vmulpd  a,t,b
      vmulpd  t,t,t
vfnmadd213pd  c,t,_1
     vmovapd  t,_3d8
 vfmadd231pd  t,c,_1d2
      vmulpd  c,c,a
 vfmadd231pd  a,t,c
}

; a = b / Sqrt[c] :
macro vdivsqrtpd_48 a,b,c,t,t_,_1,_1d2,_3d8,_5d16 {
   vcvtpd2ps  t,c
    vrsqrtps  t_,t_
   vcvtps2pd  t,t_
      vmulpd  a,t,b
      vmulpd  t,t,t
vfnmadd213pd  c,t,_1
     vmovapd  t,_5d16
 vfmadd231pd  t,c,_3d8
 vfmadd231pd  t,c,_1d2
      vmulpd  c,c,a
 vfmadd231pd  a,t,c
}


; a = b / Sqrt[c] :
macro vdivsqrtpd_60 a,b,c,t,t_,_1,_1d2,_3d8,_5d16,_35d128 {
   vcvtpd2ps  t_,c
    vrsqrtps  t_,t_
   vcvtps2pd  t,t_
      vmulpd  a,t,b
      vmulpd  t,t,t
vfnmadd213pd  c,t,_1
     vmovapd  t,_35d128
 vfmadd213pd  t,c,_5d16
 vfmadd213pd  t,c,_3d8
 vfmadd213pd  t,c,_1d2
      vmulpd  c,c,a
 vfmadd231pd  a,t,c
}              
Post 14 Jun 2013, 07:07
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16904
Location: In your JS exploiting you and your system
revolution
Please show examples of how to use these macros.

[edit to add] Also, perhaps you should mention the minimum required CPU to execute these instructions.
Post 14 Jun 2013, 09:59
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 721
tthsqe
Ok, here is my testing apparatus. Since all of those fmadd's seemed to make you madd, I stuck to plain SSE. However, the speed up is about double with 256bit AVX vectors. The sin and cos is surprisingly accurate.

Code:
format PE64 GUI 5.0
entry Start

include 'win64a.inc'

; these macros perform various operations a = f(b,c)
; they all need extra temporary registers t,u,v,s...
;  and a few rational constants _PdQ ( = vector of P/Q) that can reside in memory
; a, b, c, and t should all be different registers
;
; the macros have names fxn_N where N/12 is the order of the approximation
; N is the THEORETICAL number of correct bits in the result
; for example, divpd_60 uses a quintic algorithm since rccps gives 12 correct bits

; a = b / Sqrt[c] :  10x block throughput: 8 cycles
macro  divsqrtpd_36 a,b,c,t {
    cvtpd2ps  t,c
     rsqrtps  t,t
    cvtps2pd  a,t
      movaps  t,a
       mulpd  a,b
       mulpd  t,t
       mulpd  t,c
      movaps  c,dqword[const_pd_1]
       subpd  c,t
      movaps  t,c
       mulpd  c,dqword[const_pd_3d8]
       addpd  c,dqword[const_pd_1d2]
       mulpd  t,a
       mulpd  c,t
       addpd  a,c
}

; a = b / Sqrt[c] :  10x block throughput: 9 cycles
macro  divsqrtpd_48 a,b,c,t {
    cvtpd2ps  t,c
     rsqrtps  t,t
    cvtps2pd  a,t
      movaps  t,a
       mulpd  a,b
       mulpd  t,t
       mulpd  t,c
      movaps  c,dqword[const_pd_1]
       subpd  c,t
      movaps  t,c
       mulpd  c,dqword[const_pd_5d16]
       addpd  c,dqword[const_pd_3d8]
       mulpd  c,t
       addpd  c,dqword[const_pd_1d2]
       mulpd  t,a
       mulpd  c,t
       addpd  a,c
}

; a = b / Sqrt[c] :  10x block throughput: 10 cycles
macro  divsqrtpd_60 a,b,c,t,_1,_1d2,_3d8,_5d16,_35d128 {
    cvtpd2ps  t,c
     rsqrtps  t,t
    cvtps2pd  a,t
      movaps  t,a
       mulpd  a,b
       mulpd  t,t
       mulpd  t,c
      movaps  c,dqword[const_pd_1]
       subpd  c,t
      movaps  t,c
       mulpd  c,dqword[const_pd_35d128]
       addpd  c,dqword[const_pd_5d16]
       mulpd  c,t
       addpd  c,dqword[const_pd_3d8]
       mulpd  c,t
       addpd  c,dqword[const_pd_1d2]
       mulpd  t,a
       mulpd  c,t
       addpd  a,c
}

;a = 1 / c :  10x block throughput: 5 cycles
macro  rcppd_36 a,c,t {
    cvtpd2ps  t,c
       rcpps  t,t
    cvtps2pd  t,t
      movaps  a,t
       mulpd  t,c
      movaps  c,dqword[const_pd_1]
       subpd  c,t
      movaps  t,c
       mulpd  c,c
       addpd  c,t
       mulpd  c,a
       addpd  a,c
}

;a = 1 / c :  10x block throughput: 6 cycles
macro  rcppd_60 a,c,t {
    cvtpd2ps  t,c
       rcpps  t,t
    cvtps2pd  t,t
      movaps  a,t
       mulpd  t,c
      movaps  c,dqword[const_pd_1]
       subpd  c,t
      movaps  t,c
       mulpd  t,c
       addpd  c,t
       addpd  t,dqword[const_pd_1]
       mulpd  t,c
       mulpd  t,a
       addpd  a,t
}

;a = b / c :  10x block throughput: 6 cycles
macro  divpd_36 a,b,c,t {
    cvtpd2ps  t,c
       rcpps  t,t
    cvtps2pd  a,t
      movaps  t,a
       mulpd  a,b
       mulpd  t,c
      movaps  c,dqword[const_pd_1]
       subpd  c,t
      movaps  t,c
       mulpd  c,c
       addpd  c,t
       mulpd  c,a
       addpd  a,c
}

;a = b / c :  10x block throughput: 7 cycles
macro  divpd_60 a,b,c,t {
    cvtpd2ps  t,c
       rcpps  t,t
    cvtps2pd  a,t
      movaps  t,a
       mulpd  a,b
       mulpd  t,c
      movaps  c,dqword[const_pd_1]
       subpd  c,t
      movaps  t,c
       mulpd  t,c
       addpd  c,t
       addpd  t,dqword[const_pd_1]
       mulpd  t,c
       mulpd  t,a
       addpd  a,t
}

; {a,b} = {Sin[c],Cos[c]}
macro sincospd_50 a,b,c,t,s {        ; two temporaries
                mulpd  c,dqword[const_pd_c0]
             cvtpd2dq  s,c
             cvtdq2pd  t,s
                subpd  c,t
               movaps  t,dqword[const_pd_1d16]
                mulpd  t,c
                mulpd  t,t

                 pand  s,dqword[const_dq_3]
             cvtdq2pd  s,s

               movaps  a,dqword[const_pd_s4]
                mulpd  a,t
                addpd  a,dqword[const_pd_s3]
                mulpd  a,t
                addpd  a,dqword[const_pd_s2]
                mulpd  a,t
                addpd  a,dqword[const_pd_s1]
                mulpd  a,t
                addpd  a,dqword[const_pd_s0]

               movaps  b,dqword[const_pd_c4]
                mulpd  b,t
                addpd  b,dqword[const_pd_c3]
                mulpd  b,t
                addpd  b,dqword[const_pd_c2]
                mulpd  b,t
                addpd  b,dqword[const_pd_c1]
                mulpd  b,t
    repeat 4
               movaps  t,a
                mulpd  t,b
                subpd  a,t
               movapd  t,dqword[const_pd_m2]
                mulpd  t,b
                addpd  t,dqword[const_pd_4]
                mulpd  b,t
    end repeat
               movaps  t,s
                subpd  t,dqword[const_pd_1]
                addpd  s,dqword[const_pd_m2]
                andpd  t,dqword[const_pd_abs]
                andpd  s,dqword[const_pd_abs]
                subpd  t,dqword[const_pd_1]
                subpd  s,dqword[const_pd_1]

                mulpd  a,c
                subpd  b,dqword[const_pd_1]
;a = a*s+b*t
;b   a*t-b*s
               movaps  c,a
                mulpd  c,t
                mulpd  t,b
                mulpd  a,s
                mulpd  s,b
                addpd  a,t
               movaps  b,c
                subpd  b,s
}




macro START_MARKER
{
 mov ebx,111
 db 0x64,0x67,0x90
}

macro END_MARKER
{
 mov ebx,222
 db 0x64,0x67,0x90
}






section '.text' code readable executable


Start:                 push  rbp


LOOPS = 20000000

                        mov  ecx,LOOPS/100
.warmup_loop:        movaps  xmm2,dqword[c]
                sincospd_50  xmm0,xmm1,xmm2,xmm3,xmm4
                      ; now {xmm0,xmm1} = {Sin[xmm2],Cos[xmm2]}
                      ; xmm2,xmm3,xmm4 are garbage
                        sub  ecx,1
                        jnz  .warmup_loop


                     invoke  QueryPerformanceCounter,Time1


                        mov  ecx,LOOPS
                      align  64
.approx_loop:
        repeat 4
                     movaps  xmm1,dqword[b]
                     movaps  xmm2,dqword[c]
               divsqrtpd_48  xmm0,xmm1,xmm2,xmm3
                     movaps  dqword[a],xmm0
        end repeat
                        sub  ecx,1
                        jnz  .approx_loop


                     invoke  QueryPerformanceCounter,Time2


                        mov  ecx,LOOPS
                      align  64
.exact_loop:
        repeat 4
                     movapd  xmm0,dqword[b]
                     sqrtpd  xmm2,dqword[c]
                      divpd  xmm0,xmm2
                     movapd  dqword[A],xmm0
        end repeat
                        sub  ecx,1
                        jnz  .exact_loop



                     invoke  QueryPerformanceCounter,Time3


                        lea  rdi,[Message]
                        mov  rax,'speed up'
                      stosq
                        mov  rax,'factor: '
                      stosq

                       fild  dword[thousand]
                       fild  qword[Time1]
                       fild  qword[Time3]
                       fild  qword[Time2]
                       fsub  st2,st0
                     fsubrp  st1,st0
                     fdivrp  st1,st0
                      fmulp  st1,st0
                      fistp  dword[rsp-8]
                        mov  eax,dword[rsp-8]
                        xor  edx,edx
                        div  dword[thousand]
                       push  rdx
                       call  PrintInteger
                        mov  al,'.'
                      stosb

                        pop  rax
                        xor  edx,edx
                        mov  ecx,100
                        div  ecx
                       push  rdx
                       call  PrintInteger

                        pop  rax
                        xor  edx,edx
                        mov  ecx,10
                        div  ecx
                       push  rdx
                       call  PrintInteger

                        pop  rax
                       call  PrintInteger

                        mov  al,0x0d
                      stosb



irps i, 0 1     {
                        fld  qword[A+8*i]
                       call  PrintBinaryFloat
                        mov  al,0xd
                      stosb
                        fld  qword[a+8*i]
                       call  PrintBinaryFloat
                        mov  al,0xd
                      stosb
                        mov  eax,'----'
                      stosd
                        mov  al,0xd
                      stosb
}

                        mov  al,0
                      stosb

                     invoke  MessageBoxA,0,Message,Caption,MB_OK
                     invoke  ExitProcess,0






PrintBinaryFloat:
                       fstp  tword[rsp-16+4]
                        mov  ax,word[rsp-16+12]
                        mov  byte[rdi],'+'
                         bt  eax,15
                        jnc  @f
                        mov  byte[rdi],'-'
                 @@:    add  rdi,1
                        and  ax,0x7FFF
                         jz  .Zero
                        cmp  ax,0x7FFF
                         je  .Infinity

                        mov  rdx,qword[rsp-16+4]
                        mov  al,'0'
                        shl  rdx,1
                        adc  eax,0
                      stosb

                        mov  al,'.'
                      stosb

                        mov  ecx,52
                @@:     mov  al,'0'
                        shl  rdx,1
                        adc  eax,0
                      stosb
                        sub  ecx,1
                        jnz  @b

                        mov  eax,'*2^'
                      stosd
                        sub  rdi,1
                      movzx  eax,word[rsp-16+12]
                        and  eax,0x07FFF
                        sub  rax,16383
                       call  PrintInteger
                        jmp  .ret


        .Zero:          mov  al,'0'
                      stosb
                        jmp  .ret

        .Infinity:
                        mov  rax,[rsp-16+4]
                       test  rax,rax
                        jns  .Nan
                        mov  eax,'Inf '
                      stosd
                        jmp  .ret

        .Nan:
                        mov  eax,'NaN '
                      stosd
                       fstp  st0
                      ;  jmp  .ret

.ret:
                        ret


PrintInteger:       ; rax: number
                       push  rbp
                       push  rdx
                        mov  rbp,rsp
                       test  rax,rax
                        jns  .l1
                        mov  byte[rdi],'-'
                        add  rdi,1
                        neg  rax
                .l1:    xor  edx,edx
                        div  [OutputBase]
                       push  rdx
                       test  rax,rax
                        jnz  .l1
                .l2:    pop  rax
                        cmp  al,9
                        jbe  @f
                        add  al,'a'-10-'0'
                 @@:    add  al,'0'
                      stosb
                        cmp  rsp,rbp
                         jb  .l2
                        pop  rdx
                        pop  rbp
                        ret


section '.data' data readable writeable

align 16
; results
A: dq ?,?
a  dq ?,?
; operands
b:  dq 3.89,7.98
c:  dq 5.17,6.17
; constants
const_pd_m2     dq -2.0,-2.0
const_pd_4      dq 4.0,4.0
const_pd_1:     dq 1.0,1.0
const_pd_1d2    dq 0.5,0.5
const_pd_3d8    dq 0.375,0.375
const_pd_5d16   dq 0.3125,0.3125
const_pd_1d16   dq 0.0625,0.0625
const_pd_35d128 dq 0.2734375, 0.2734375
const_pd_c0     dq 0.63661977236758134308,0.63661977236758134308
const_pd_c1     dq 1.2337005501361698274,1.2337005501361698274
const_pd_c2     dq -0.25366950790104801364,-0.25366950790104801364
const_pd_c3     dq 0.020863480763352960873,0.020863480763352960873
const_pd_c4     dq -0.00091926027483942658024,-0.00091926027483942658024
const_pd_s0     dq 1.5707963267948966192,1.5707963267948966192
const_pd_s1     dq -0.64596409750624625366,-0.64596409750624625366
const_pd_s2     dq 0.079692626246167045121,0.079692626246167045121
const_pd_s3     dq -0.0046817541353186881007,-0.0046817541353186881007
const_pd_s4     dq 0.00016044118478735982187,0.00016044118478735982187
const_pd_abs    dq 0x7FFFFFFFFFFFFFFF,0x7FFFFFFFFFFFFFFF
const_dq_3      dd 3,3,3,3

align 8
  OutputBase dq 10
  Time1      dq ?
  Time2      dq ?
  Time3      dq ?

align 4
  thousand   dd 1000


align 1
  Caption db 'test',0
  Message rb 1024



section '.idata' import data readable writeable

  library kernel32,'KERNEL32.DLL',\
          user32,'USER32.DLL',\
          gdi32,'GDI32.DLL',\
          msvcrt,'MSVCRT.DLL'

include 'api\kernel32.inc'
include 'api\user32.inc'
include 'api\gdi32.inc'


 import  msvcrt,\
            sprintf,'sprintf'    
Post 15 Jun 2013, 05:54
View user's profile Send private message Reply with quote
DOS386



Joined: 08 Dec 2006
Posts: 1903
DOS386
Quote:
you should mention the minimum required CPU to execute these instructions.


IIRC I coded an 80386-compatible (no FPU) square root function and a 8080-and-6502-compatible (no DIV instruction) decimal output routine some time ago. You know, programming is art Smile

PS: buying a new PC every week is not art

_________________
Bug Nr.: 12345

Title: Hello World program compiles to 100 KB !!!

Status: Closed: NOT a Bug
Post 15 Jun 2013, 09:48
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 721
tthsqe
The question is not if it can be done (even without an FPU), but rather if we can get close to the desired accuracy much FASTER than the built-in functions.
I also do not have a haswell cpu - I was just illustrating how the haswell new instructions can simplify things tremendously. Whether or not your solution uses fancy instructions, as long as it is elegant I think it is good art. The solutions in my most recent point will run on the early AMD 64 bit architecture.

PS: would you mind posting your no-FPU no-DIV square root solution? Does it work on single, double, or extended precision? Or is it for integer square roots?
Post 15 Jun 2013, 15:24
View user's profile Send private message Reply with quote
DOS386



Joined: 08 Dec 2006
Posts: 1903
DOS386
tthsqe wrote:
posting your no-FPU no-DIV square root solution?

Or is it for integer square roots?


32-bit integers (and it has DIV, but the 64-bit decimal output is DIV-free).

I also coded a 66-Byte's DIV-free-8080-compatible SINE function:
http://www.freebasic.net/forum/viewtopic.php?t=12490
Post 16 Jun 2013, 02:27
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2019, Tomasz Grysztar.

Powered by rwasa.