flat assembler
Message board for the users of flat assembler.

 Index > Main > Code optimization (AKA C vs. Asm) Goto page Previous  1, 2, 3, 4, 5  Next
Author
bitRAKE

Joined: 21 Jul 2003
Posts: 3959
Location: vpcmipstrm
bitRAKE 30 Dec 2009, 15:13
tthsqe wrote:
ritRAKE,
I'm not sure what to make of your post. The equations you give follow directly from the fact that {{1,0},{1,1}}^n is a matrix of f(n-1) f(n) and f(n+1) (I think the anti diagonal elements are both f(n)). Since n was is power of 2, everything I've said points to the conjecture that mathematica is using exactly the equations you have written down.
Itterate over bits of (n), from msb to lsb, to select a path. Don't need to directly calculate fib(2n). So, only squaring and add/sub are needed. Sorry, I misunderstood your question.

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
30 Dec 2009, 15:13
Borsuc

Joined: 29 Dec 2005
Posts: 2465
Location: Bucharest, Romania
Borsuc 30 Dec 2009, 17:23
tthsqe wrote:
ritRAKE,
I'm not sure what to make of your post. The equations you give follow directly from the fact that {{1,0},{1,1}}^n is a matrix of f(n-1) f(n) and f(n+1) (I think the anti diagonal elements are both f(n)). Since n was is power of 2, everything I've said points to the conjecture that mathematica is using exactly the equations you have written down.
Except that his post is clear to mere mortals like me instead of math gods.

@f0dder: how is it difficult to implement it in that folder-scanning example? You just replace the call with a jmp but still push/pop the folder handles on the stack.

(of course you could use an array instead of the stack if you wanted but that's not necessarily).

(also replace the ret with a corresponding jmp obviously)

_________________
Previously known as The_Grey_Beast
30 Dec 2009, 17:23
tthsqe

Joined: 20 May 2009
Posts: 767
tthsqe 31 Dec 2009, 00:49
Ah, so you were talking about only keeping track of TWO values. I gave that a shot, but the results were about 50% slower than the built in function:

a = 0; b = 1;
Accumulate[
Table[Timing[c = (b - a)^2; {a, b} = {a^2 + c, b^2 + c};][[1]],
{i, 1, 30}]][[18 ;; 30]]

{0.015, 0.047, 0.078, 0.171, 0.374, 0.78, 1.841, 4.118, 8.798, 18.361, 38.719, 81.713, 178.371}

The built in function probably uses a host of other optimizations as well as the one you mentioned. (The timing must be accumulated because although mathematica caches computed results, this seems not affect the calculation of further results)
31 Dec 2009, 00:49
bitRAKE

Joined: 21 Jul 2003
Posts: 3959
Location: vpcmipstrm
bitRAKE 31 Dec 2009, 04:27
I'd rather code it in x86 and compare.

Mathematica is awesome, but I no longer have a license.
31 Dec 2009, 04:27
tthsqe

Joined: 20 May 2009
Posts: 767
tthsqe 31 Dec 2009, 06:05
Mathematica is awesome, but the new version 7 has some kind debugger deflector, so you can' see what's going on in ollydbg.
Quote:

I'd rather code it in x86 and compare.

Please do - I would not be uninterested in the results. But, you are going to have to code a good multiprecision multiplier (f(2^30) is about 10 million qwords), and I hear that mathematica has adopted the GMP library just for this reason.
31 Dec 2009, 06:05
bitRAKE

Joined: 21 Jul 2003
Posts: 3959
Location: vpcmipstrm
bitRAKE 31 Dec 2009, 15:15
Don't you mean a good specialized squaring routined? Mathematica has adopted much open source on the back-end from what I've read.
31 Dec 2009, 15:15
Borsuc

Joined: 29 Dec 2005
Posts: 2465
Location: Bucharest, Romania
Borsuc 31 Dec 2009, 15:32
Can't you use floating point though?

bitRAKE wrote:
Mathematica is awesome
Mathematica is bloated.

For computations (not symbolic) use PARI. Light and very fast.

_________________
Previously known as The_Grey_Beast
31 Dec 2009, 15:32
bitRAKE

Joined: 21 Jul 2003
Posts: 3959
Location: vpcmipstrm
bitRAKE 01 Jan 2010, 02:59
True, from a certain perspective Mathematica is bloated, but we cannot ignore the wider audience and scope of utility covered by Mathematica - from the initial design Wolfram has had a desire to bring mathematics to a wider audience as well as easing the digital use of existing mathematics in education. I have much respect for this - both the man and his work.
01 Jan 2010, 02:59
tthsqe

Joined: 20 May 2009
Posts: 767
tthsqe 01 Jan 2010, 04:14
borsuc,
what do you have in mind for a floating point approach?
01 Jan 2010, 04:14
bitRAKE

Joined: 21 Jul 2003
Posts: 3959
Location: vpcmipstrm
bitRAKE 01 Jan 2010, 20:48
tthsqe wrote:
f(2^30) is about 10 million qwords
What is your memory bus? It will give me a better idea of how much data I need to plow through in 100 seconds - I'll ignore caching for now. (My system can read/write 4GB/s to main memory - slow compared to modern buses.)

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
01 Jan 2010, 20:48
tthsqe

Joined: 20 May 2009
Posts: 767
tthsqe 02 Jan 2010, 04:18
I'm not sure how to find that out. Here is the cpuz report, though.

02 Jan 2010, 04:18
Alphonso

Joined: 16 Jan 2007
Posts: 295
Alphonso 02 Jan 2010, 04:46
What is this code really about?
Code:
```f:     ;argument passed in eax
cmp eax,1
jbe .1
push ebx
lea ebx,[eax-2]
dec eax
call f
xchg eax,ebx
call f
pop ebx
.1  ret       ```
Since this code can only work up to N=47 then why not use a LUT? Then it would be millions of times quicker than the above. Don't take this the wrong way but TBH I find it hard to believe the above code could beat anything except maybe Loco's code. Keep in mind I'm just a layman so if you could explain while leaving out the iterative, recursion and Binet's etc it would help as I feel I'm not understanding something / everything here.

It also seems the thread has turned into a 'fibber what's his name' optimization. Just out of curiosity how long does it take mathematica to produce an answer for say fib(1,000,000)?

Happy new year to everyone, even though I'm a day late.
02 Jan 2010, 04:46
LocoDelAssembly

Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 02 Jan 2010, 05:53
Alphonso, it computes this:
Code:
```f(0) = 1
f(1) = 1
f(n) = (f(n-1) + f(n-2))(mod 2^32)    ```

The LUT is of course the way to go for computing the real fib inside the bounds of not-astronomically-big unsigned integers but the idea was not to use precomputed values (note that in any moment the discussion was directed to real-life use except for the arbitrary precision).

About the direction of this thread yeah, I was thinking about merging part of it into the other thread that was specific of fib.
02 Jan 2010, 05:53
bitRAKE

Joined: 21 Jul 2003
Posts: 3959
Location: vpcmipstrm
bitRAKE 02 Jan 2010, 06:10
tthsqe wrote:
I'm not sure how to find that out. Here is the cpuz report, though.
Here is an interesting memory speed tester. My old Pentium M laptop does 1.5GB/s - yours might be better than twice that with two cores?

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
02 Jan 2010, 06:10
Alphonso

Joined: 16 Jan 2007
Posts: 295
Alphonso 02 Jan 2010, 07:23
LocoDelAssembly wrote:
The idea was not to use precomputed values
Ok
Code:
```f:   cmp   eax,1
jbe   .1
mov   ecx,eax
xor   eax,eax
mov   ebx,eax
inc   ebx
@@:
push  eax
pop   ebx
dec   ecx
jnz   @b
.1:  ret    ```
This should do the same, limited to N=47, but take nano seconds instead of 'time to go for a coffee' time. Perhaps it's because the original code is limited to 47 that I'm confused whereas your really looking for up to 2^32-1, or maybe not. Probably best if I just keep quiet and let you guys do your stuff.
02 Jan 2010, 07:23
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20176
revolution 02 Jan 2010, 07:52
Alphonso: You may have missed the discussion about sticking with the (inefficient) recursive algorithm for the purposes of testing. So your standard linear approach above, while certainly much faster, was not the intended purpose of all the past posts in here.

Although, I am at a loss to understand what the recursive test is supposed to prove or solve?
02 Jan 2010, 07:52
LocoDelAssembly

Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 02 Jan 2010, 08:11
It started from here when tthsqe was talking about the inefficient Assembly code his compilers generated and that his own was the best. Then I've shown my code which was better on my AMD, then there was also some discussion about using iteration with an stack, and then... Well, you'll have to read
02 Jan 2010, 08:11
tthsqe

Joined: 20 May 2009
Posts: 767
tthsqe 02 Jan 2010, 08:41
Alphonso,
I guess the point was the obvious fact that it the really put the effort in you can gain some improvements over some compilers - they just don't know where to put the variables in the case of that c code.
mathematica timings for various values:
f(1000^2): 0.05
f(1024^2): 0.05
f(1000^3): 106
f(1024^3): 112

bitRAKE,
how did you get that I have two cores? only one 64 bit core trapped in a 32 bit OS ....
anyways, the result for three runs of the program was 1.83, 2.08, 1.84 GB/s
02 Jan 2010, 08:41
Alphonso

Joined: 16 Jan 2007
Posts: 295
Alphonso 02 Jan 2010, 14:12
Thanks for those numbers tthsqe. I tried going up to f(1 million) using the add method and three streams but it took about 2 minutes, so pretty slow really. Thanks for the tip on XADD but it seems to run 25% slower than a push-add-pop on my laptop.

Code:
```;;Test XADD vs Push-Add-Pop

format PE GUI 4.0
include 'win32a.inc'

TestTime=0x10000000

;========================================
;----------------------------------------;Loop Time Check
invoke  GetTickCount
mov     [TStart],eax
xor     eax,eax
mov     ebx,1

align 16
mov     ecx,TestTime
@@:
dec     ecx
jnz     @b

invoke  GetTickCount
sub     eax,[TStart]
mov     [LoopTime],eax
;----------------------------------------;PushPop Time Check
invoke  GetTickCount
mov     [TStart],eax
xor     eax,eax
mov     ebx,1
align 16
mov     ecx,TestTime
@@:
push    eax
pop     ebx
dec     ecx
jnz     @b

invoke  GetTickCount
sub     eax,[TStart]
mov     [PushPopTime],eax
invoke  GetTickCount
mov     [TStart],eax
xor     eax,eax
mov     ebx,1
align 16
mov     ecx,TestTime
@@:
dec     ecx
jnz     @b

invoke  GetTickCount
sub     eax,[TStart]
;----------------------------------------;Results
mov     eax,[LoopTime]
sub     [PushPopTime],eax
invoke  MessageBox,0,Buff,Tit,1 MB_RIGHT or MB_ICONINFORMATION
invoke  ExitProcess,0
;========================================
TStart        dd ?
LoopTime      dd ?
PushPopTime   dd ?
wsformat      db 'Loop Time : %04ums',10,'Push Pop Time : %04ums',10,'Xadd Time : %04ums',0
Buff          rb 128
;========================================
section '.idata' import data readable writeable
library kernel,'KERNEL32.DLL',\
user,'USER32.DLL'
import  kernel,\
GetTickCount,'GetTickCount',\
ExitProcess,'ExitProcess'
import  user,\
MessageBox,'MessageBoxA',\
wsprintf,'wsprintfA'
;========================================
```
02 Jan 2010, 14:12
Borsuc

Joined: 29 Dec 2005
Posts: 2465
Location: Bucharest, Romania
Borsuc 02 Jan 2010, 19:39
My initial suggestion with float was because of the much higher dynamic range, unfortunately addition doesn't do very well in floats and since fibonacci sequence is additions the precision loss might be huge. It may not have been such a good idea.
02 Jan 2010, 19:39
 Display posts from previous: All Posts1 Day7 Days2 Weeks1 Month3 Months6 Months1 Year Oldest FirstNewest First

 Jump to: Select a forum Official----------------AssemblyPeripheria General----------------MainTutorials and ExamplesDOSWindowsLinuxUnixMenuetOS Specific----------------MacroinstructionsOS ConstructionIDE DevelopmentProjects and IdeasNon-x86 architecturesHigh Level LanguagesProgramming Language DesignCompiler Internals Other----------------FeedbackHeapTest Area
Goto page Previous  1, 2, 3, 4, 5  Next

Forum Rules:
 You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot vote in polls in this forumYou cannot attach files in this forumYou can download files in this forum