flat assembler
Message board for the users of flat assembler.
![]() |
Author |
|
SomeoneNew 28 Dec 2007, 19:19
from: http://www.beyond3d.com/content/articles/8/
I was wondering if someone has ported this code into FASM yet? It appears to produce faster results than 1.0/sqr(n) ! (and yes I know the code is originally 15 years old, still, useful for some stuff). _________________ Im new, sorry if I bothered with any stupid question ![]() |
|||
![]() |
|
asmfan 28 Dec 2007, 19:24
Try to port it and compare to FPU & SSE versions if interested.
|
|||
![]() |
|
edfed 28 Dec 2007, 19:32
speaking about SQR...
by shifting right the exponent.. we obtain the square root of a fp number? |
|||
![]() |
|
DJ Mauretto 28 Dec 2007, 20:46
Code: x DD 2.0 const05 DD 0.5 i DD ? const15 DD 1.5 Return DD ? ; st st1 fld [x] ; x fmul [const05] ; 0.5*x mov eax, [x] sar eax, 1 mov ecx, 5f3759dfH sub ecx, eax mov [i], ecx ; i = 0x5f3759df - (i>>1) ; st st1 fld [i] ; i 0.5*x fmul [i] ; i*i 0.5*x fxch st1 ; 0.5*x i*i fmul st,st1 ; 0.5*x*i*i i*i fsubr [const15] ; 1.5-0.5*x*i*i i*i fmul [i] ; i*(1.5-0.5*x*i*i) i*i fstp [Return] ; i*i fstp st ; Empty Last edited by DJ Mauretto on 31 Dec 2007, 14:07; edited 1 time in total |
|||
![]() |
|
Xorpd! 29 Dec 2007, 01:43
Paul Hsieh's web page is a better source than the one given above for the piecewise-linear approximation. I don't know why you would bother at this point in time as rsqrtps gets you better than one part in 2048 while the piecewise linear method is only about one part in 33.
|
|||
![]() |
|
FrozenKnight 29 Dec 2007, 11:46
if you need accuracy just use newtons method, you can adjust the accuracy to performance ratio as needed.
|
|||
![]() |
|
Borsuc 31 Dec 2007, 14:16
SomeoneNew wrote: from: http://www.beyond3d.com/content/articles/8/ Here a different explanation on the method. I guess creating a 4th order polynomial (with least-squares differences between it and 1/sqrt) would produce better results, and it can also be factored so it can be computed in 4 multiplications and 4 additions (possibly parallel) by factoring out the roots. Though, a 5th such polynomial would be more parallelable and, of course, more accurate (however, there will be 5 muls and 5 adds). I have a question though: why does the code accept inputs between 0.5 and 1.0? Why not from 0.0 to 1.0 (I know it blows up to infinity at 0)? Do values under 0.5 work? regarding values > 1.0, they can be normalized ![]() |
|||
![]() |
|
mattst88 01 Jan 2008, 19:44
This page (http://olivermcfadden.livejournal.com/15872.html) has some links to some very interesting papers/articles regarding reciprocal square roots.
One paper linked even finds a better 'magic number' (0x5f3759df) |
|||
![]() |
|
Borsuc 02 Jan 2008, 16:14
edfed wrote: by shifting right the exponent.. we obtain the square root of a fp number? for full sqrt you'll also need the square root of the mantissa, which is the complicated problem of doing it fast ![]() |
|||
![]() |
|
Madis731 02 Jan 2008, 17:13
I think...
Code: 4.4 Floating point XMM instructions Instruction Operands uops fused domain uops unfused domain Latency Reciprocal throughput Math p0 p1 p01 p2 p3 p4 RSQRTPS xmm,xmm 2 3 3 2 RSQRTPS xmm,m128 4 2 2 2 |
|||
![]() |
|
revolution 02 Jan 2008, 17:21
IIRC the AMD optimisation manual has code to compute full precision 1/SQRT using newtons method after the first RSQRTPS & RSQRTPD
|
|||
![]() |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2023, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.