flat assembler
Message board for the users of flat assembler.
Index
> Main > Logarithm in SSE2 ? 
Author 

Kuemmel 19 Jan 2008, 23:42
Hi people,
recently came on some interesting fractals called "Lyapunov fractals". It's a very demanding algortihm as it has got an LOG2 in the iteration loop. I see that the FPU for that instruction (FYL2X) got a latency of 96 for Core2Duo and there's no instruction in SSE2. Anybody ever implemented a precise enough (close to equal for double precision floats) numerical calculation for LOG2 that would beat the use of FPU in that case (didn't find anything googling) ? Are at least a link to a decent to implement algorithm ? ...or no way to beat the FPU ? 

19 Jan 2008, 23:42 

bitRAKE 20 Jan 2008, 16:49
Search for "AMaths.zip" by Intel.


20 Jan 2008, 16:49 

LocoDelAssembly 20 Jan 2008, 17:12


20 Jan 2008, 17:12 

Kuemmel 20 Jan 2008, 17:22
Thanks for the hint...found it...problem is that it's probably not accurate enough and therefore also only for single precision. There's an article about it:
http://softwarecommunity.intel.com/Wiki/DevelopforCoreprocessor/289.htm and http://softwarecommunity.intel.com/articles/eng/1795.htm ...seems to be the others are not for free and no source...so I got to search further I think... 

20 Jan 2008, 17:22 

LocoDelAssembly 20 Jan 2008, 17:33
Excuss my ignorance but can't am_log_ss be translated to am_log_sd by doing few modifications?


20 Jan 2008, 17:33 

Borsuc 20 Jan 2008, 20:14
What is the range of the parameter you need? (e.g 0....1)
You can always approximate this range with a polynomial of huge degree (depends on precision required), and evaluate it incrementally thereafter. Taylor series usually is poor for such a thing, you'll need something better like a leastsquares approach (you can use integrals to find the coefficients). Now I know I sound confusing and I am prolly wrong with this too, but maybe this sparks some ideas for you. 

20 Jan 2008, 20:14 

Jack 20 Jan 2008, 21:47
can some one figure out how the function log is evaluated in the the AMaths lib?
I see that it uses a rational polynomial approximation but can't figure it out. 

20 Jan 2008, 21:47 

Kuemmel 20 Jan 2008, 22:00
I didn't look much at the code until now, may be it is similar like some iterations on the python code mentioned on
http://en.wikipedia.org/wiki/Binary_logarithm 

20 Jan 2008, 22:00 

bitRAKE 20 Jan 2008, 22:49
http://www.math.ucl.ac.be/membres/magnus/num1a/remezjs.htm
Seems to be a calculator for polynomial, but I cannot figure out how to use it. 

20 Jan 2008, 22:49 

Madis731 21 Jan 2008, 10:54
I tested this: http://www.flipcode.com/archives/Fast_log_Function.shtml
and I got this: Code: format PE64 GUI entry start include 'win64a.inc' section '.code' code readable executable start: ;Works best with FDBG notice! movapd xmm0,[initial] andpd xmm0,[exponent] orpd xmm0,[exp_and_one] movhlps xmm1,xmm0 psubd xmm0,[_1023] psrld xmm0,20 cvtdq2pd xmm0,xmm0 movsd xmm0,xmm1 mulsd xmm0,[one_third] addsd xmm0,[two] mulsd xmm0,xmm1 subsd xmm0,[two_thirds] movhlps xmm1,xmm0 addsd xmm0,xmm1 int3 invoke ExitProcess,0 section '.idata' import data readable writeable library kernel,'KERNEL32.DLL' import kernel,\ ExitProcess,'ExitProcess' align 16 initial: dq 8255.9990234, 8255.9990234 result dq 13.0112266, 13.0112266 exponent: dq 7FF0000000000000h, 0FFFFFFFFFFFFFh exp_and_one: dq 0, 03FF0000000000000h one_third: dq 0.333333333333333333333,0 two_thirds: dq 0.666666666666666666666,0 two: dq 2.0, 0 _1023: dd 3FF00000h, 3FF00000h, 3FF00000h, 3FF00000h 21.3 clocks (RDTSC million iterations / 1000000) 

21 Jan 2008, 10:54 

Jack 23 Jan 2008, 02:40
I found this on the web, http://globemeta.ifh.de:8080/lenya/hpc/live/APE/hardware/APEmille/apemilleInstructions/apemilleMath.html
<Japanese removed> Last edited by Jack on 24 Jan 2008, 23:57; edited 1 time in total 

23 Jan 2008, 02:40 

Madis731 23 Jan 2008, 15:03
Your proposed solution is not a better one because:
A) it uses loops that is generally bad (unless loop overhead is relatively small) B) the FXTRACT instruction alone takes 82 clocks with 170 clock latency (compared to respective values are 53 and 96). I think this paper was written to achieve better accuracy and that's where it fits. Right now, as I understood, Kuemmel needs good enough precision with relatively good performance. Btw, I tried to make your code eat less clocks with simple binary arithmetic on the extraction of exponent and mantissa, but I don't understand the language. Where do all these variables come from? And what's with the dots? 

23 Jan 2008, 15:03 

Jack 23 Jan 2008, 23:56
<japanese removed>
Last edited by Jack on 24 Jan 2008, 23:58; edited 1 time in total 

23 Jan 2008, 23:56 

r22 24 Jan 2008, 02:25
re: Madis
From the flip code link Quote: but the maximum error is below 0.007 Thats not nearly precise enough for Double Precision FP numbers (ie: Kuemmel's purpose), although it is totally inline which is good and fast I think we might need a little looping to tune the precision. ======================================== re: Jack If you're in New York City do you give someone directions in Japanese? If you want to represent an algorithm but can't do it in ASM (this IS an ASM board) for clarities sake just use simple math/pseudo code instead of PureBASIC that not everyone knows (or wants to know) how to read. ======================================== The Taylor series for natural log ( Ln(x) ) for x=0..1 is LN(x) = (1x)  (1x)^2/2 + (1x)^3/3  (1x)^4/4 ... This can be easily adapted into SSE. The problem is it takes a while to converge and theirs a fair amount of calculation. Using the parallel properties of SSE and all the XMM registers we can iterate ~4 (probably better to unroll) times in ASM but compute to the 16th N of the series. This would give improved percision at a cost of (I'm estimating here) 65 clocks which would be a 1.5x to 2x speed boost over the FPU. The method Jack uncovered may also be viable for porting to SSE and have the loop parallelized. But like most algorithms it really depends on the usage, a 1MB Look Up Table may well be the best method, new CPUs have more and more L2 cache now. 

24 Jan 2008, 02:25 

Jack 24 Jan 2008, 03:01
r22, sorry, won't hapen again.


24 Jan 2008, 03:01 

r22 25 Jan 2008, 03:44
re: jack
There was no need to remove it (a paper and an ambiguous working implementation is better than a paper and nothing). The intent of my comment (which didn't come accross in text form) was more jocular than serious. ========================================= A link I found about a LUT version of logarithm. It has some speed/accuracy benchmarks on modern CPUs. http://www.icsi.berkeley.edu/pubs/techreports/TR07002.pdf Lookup table algorithm for log. (the C code is at the end of the article) 

25 Jan 2008, 03:44 

< Last Thread  Next Thread > 
Forum Rules:

Copyright © 19992023, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.
Website powered by rwasa.