flat assembler
Message board for the users of flat assembler.

Index > Main > Logarithm in SSE2 ?

Author
Thread Post new topic Reply to topic
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 19 Jan 2008, 23:42
Hi people,

recently came on some interesting fractals called "Lyapunov fractals". It's a very demanding algortihm as it has got an LOG2 in the iteration loop.

I see that the FPU for that instruction (FYL2X) got a latency of 96 for Core2Duo and there's no instruction in SSE2.

Anybody ever implemented a precise enough (close to equal for double precision floats) numerical calculation for LOG2 that would beat the use of FPU in that case (didn't find anything googling) ? Are at least a link to a decent to implement algorithm ? ...or no way to beat the FPU ?
Post 19 Jan 2008, 23:42
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4182
Location: vpcmpistri
bitRAKE 20 Jan 2008, 16:49
Search for "AMaths.zip" by Intel.
Post 20 Jan 2008, 16:49
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 20 Jan 2008, 17:12
Post 20 Jan 2008, 17:12
View user's profile Send private message Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 20 Jan 2008, 17:22
Thanks for the hint...found it...problem is that it's probably not accurate enough and therefore also only for single precision. There's an article about it:
http://softwarecommunity.intel.com/Wiki/DevelopforCoreprocessor/289.htm
and
http://softwarecommunity.intel.com/articles/eng/1795.htm
...seems to be the others are not for free and no source...so I got to search further I think...
Post 20 Jan 2008, 17:22
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 20 Jan 2008, 17:33
Excuss my ignorance but can't am_log_ss be translated to am_log_sd by doing few modifications?
Post 20 Jan 2008, 17:33
View user's profile Send private message Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 20 Jan 2008, 19:16
...I thought immediately the same, and may be it's really worth a try, but the author of the Lib made same statements about limited accuracy.

This might be okay for non-iterative algorithms, but if you for example run 4000 Iterations (with adding results, as required by this fractal code) you also add up the errors. I'll see...
Post 20 Jan 2008, 19:16
View user's profile Send private message Visit poster's website Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2465
Location: Bucharest, Romania
Borsuc 20 Jan 2008, 20:14
What is the range of the parameter you need? (e.g 0....1)

You can always approximate this range with a polynomial of huge degree (depends on precision required), and evaluate it incrementally thereafter. Taylor series usually is poor for such a thing, you'll need something better like a least-squares approach (you can use integrals to find the coefficients).

Now I know I sound confusing and I am prolly wrong with this too, but maybe this sparks some ideas for you.
Post 20 Jan 2008, 20:14
View user's profile Send private message Reply with quote
Jack



Joined: 16 Feb 2005
Posts: 21
Jack 20 Jan 2008, 21:47
can some one figure out how the function log is evaluated in the the AMaths lib?
I see that it uses a rational polynomial approximation but can't figure it out.
Post 20 Jan 2008, 21:47
View user's profile Send private message Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 20 Jan 2008, 22:00
I didn't look much at the code until now, may be it is similar like some iterations on the python code mentioned on
http://en.wikipedia.org/wiki/Binary_logarithm
Post 20 Jan 2008, 22:00
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4182
Location: vpcmpistri
bitRAKE 20 Jan 2008, 22:49
http://www.math.ucl.ac.be/membres/magnus/num1a/remezjs.htm

Seems to be a calculator for polynomial, but I cannot figure out how to use it. Confused
Post 20 Jan 2008, 22:49
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 21 Jan 2008, 10:54
I tested this: http://www.flipcode.com/archives/Fast_log_Function.shtml
and I got this:
Code:
format PE64 GUI                                                     
entry start                                                         
                                                                    
include 'win64a.inc'                                                
                                                                    
section '.code' code readable executable                            
start:                                                              
        ;Works best with FDBG notice! Smile                            
        movapd  xmm0,[initial]                                      
        andpd   xmm0,[exponent]                                     
        orpd    xmm0,[exp_and_one]                                  
        movhlps xmm1,xmm0                                           
        psubd   xmm0,[_1023]                                        
        psrld   xmm0,20                                             
        cvtdq2pd xmm0,xmm0                                          
        movsd   xmm0,xmm1                                           
        mulsd   xmm0,[one_third]                                    
        addsd   xmm0,[two]                                          
        mulsd   xmm0,xmm1                                           
        subsd   xmm0,[two_thirds]                                   
        movhlps xmm1,xmm0                                           
        addsd   xmm0,xmm1                                           
int3                                                                
        invoke  ExitProcess,0                                       
                                                                    
                                                                    
section '.idata' import data readable writeable                     
                                                                    
library kernel,'KERNEL32.DLL'                                       
                                                                    
import kernel,\                                                     
        ExitProcess,'ExitProcess'                                   
                                                                    
align 16                                                            
initial:        dq      8255.9990234,           8255.9990234        
result          dq      13.0112266,             13.0112266          
exponent:       dq      7FF0000000000000h,      0FFFFFFFFFFFFFh     
exp_and_one:    dq      0,                      03FF0000000000000h  
one_third:      dq     -0.333333333333333333333,0                   
two_thirds:     dq      0.666666666666666666666,0                   
two:            dq      2.0,                    0                   
_1023:          dd      3FF00000h, 3FF00000h,   3FF00000h, 3FF00000h    


21.3 clocks (RDTSC million iterations / 1000000)
Post 21 Jan 2008, 10:54
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Jack



Joined: 16 Feb 2005
Posts: 21
Jack 23 Jan 2008, 02:40


Last edited by Jack on 24 Jan 2008, 23:57; edited 1 time in total
Post 23 Jan 2008, 02:40
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 23 Jan 2008, 15:03
Your proposed solution is not a better one because:
A) it uses loops that is generally bad (unless loop overhead is relatively small)
B) the FXTRACT instruction alone takes 82 clocks with 170 clock latency (compared to respective values are 53 and 96).

I think this paper was written to achieve better accuracy and that's where it fits.
Right now, as I understood, Kuemmel needs good enough precision with relatively good performance.

Btw, I tried to make your code eat less clocks with simple binary arithmetic on the extraction of exponent and mantissa, but I don't understand the language. Where do all these variables come from? And what's with the dots? Razz
Post 23 Jan 2008, 15:03
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Jack



Joined: 16 Feb 2005
Posts: 21
Jack 23 Jan 2008, 23:56
<japanese removed>


Last edited by Jack on 24 Jan 2008, 23:58; edited 1 time in total
Post 23 Jan 2008, 23:56
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 24 Jan 2008, 02:25
re: Madis
From the flip code link
Quote:
but the maximum error is below 0.007

Thats not nearly precise enough for Double Precision FP numbers (ie: Kuemmel's purpose), although it is totally inline which is good and fast I think we might need a little looping to tune the precision.
========================================

re: Jack
If you're in New York City do you give someone directions in Japanese? If you want to represent an algorithm but can't do it in ASM (this IS an ASM board) for clarities sake just use simple math/pseudo code instead of PureBASIC that not everyone knows (or wants to know) how to read.
========================================

The Taylor series for natural log ( Ln(x) ) for x=0..1 is
LN(x) = (1-x) - (1-x)^2/2 + (1-x)^3/3 - (1-x)^4/4 ...

This can be easily adapted into SSE. The problem is it takes a while to converge and theirs a fair amount of calculation. Using the parallel properties of SSE and all the XMM registers we can iterate ~4 (probably better to unroll) times in ASM but compute to the 16th N of the series. This would give improved percision at a cost of (I'm estimating here) 65 clocks which would be a 1.5x to 2x speed boost over the FPU.

The method Jack uncovered may also be viable for porting to SSE and have the loop parallelized.

But like most algorithms it really depends on the usage, a 1MB Look Up Table may well be the best method, new CPUs have more and more L2 cache now.
Post 24 Jan 2008, 02:25
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Jack



Joined: 16 Feb 2005
Posts: 21
Jack 24 Jan 2008, 03:01
r22, sorry, won't hapen again.
Post 24 Jan 2008, 03:01
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 25 Jan 2008, 03:44
re: jack
There was no need to remove it (a paper and an ambiguous working implementation is better than a paper and nothing). The intent of my comment (which didn't come accross in text form) was more jocular than serious.
=========================================

A link I found about a LUT version of logarithm. It has some speed/accuracy benchmarks on modern CPUs.

http://www.icsi.berkeley.edu/pubs/techreports/TR-07-002.pdf
Lookup table algorithm for log. (the C code is at the end of the article)
Post 25 Jan 2008, 03:44
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.