flat assembler
Message board for the users of flat assembler.

Index > Main > Code optimization (AKA C vs. Asm)

Goto page Previous  1, 2, 3, 4, 5  Next
Author
Thread Post new topic Reply to topic
bitRAKE



Joined: 21 Jul 2003
Posts: 2913
Location: [RSP+8*5]
bitRAKE
tthsqe wrote:
ritRAKE,
I'm not sure what to make of your post. The equations you give follow directly from the fact that {{1,0},{1,1}}^n is a matrix of f(n-1) f(n) and f(n+1) (I think the anti diagonal elements are both f(n)). Since n was is power of 2, everything I've said points to the conjecture that mathematica is using exactly the equations you have written down. Confused
Itterate over bits of (n), from msb to lsb, to select a path. Don't need to directly calculate fib(2n). So, only squaring and add/sub are needed. Sorry, I misunderstood your question.

_________________
¯\(°_o)/¯ unlicense.org
Post 30 Dec 2009, 15:13
View user's profile Send private message Visit poster's website Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
tthsqe wrote:
ritRAKE,
I'm not sure what to make of your post. The equations you give follow directly from the fact that {{1,0},{1,1}}^n is a matrix of f(n-1) f(n) and f(n+1) (I think the anti diagonal elements are both f(n)). Since n was is power of 2, everything I've said points to the conjecture that mathematica is using exactly the equations you have written down. Confused
Except that his post is clear to mere mortals like me instead of math gods. Wink

@f0dder: how is it difficult to implement it in that folder-scanning example? You just replace the call with a jmp but still push/pop the folder handles on the stack.

(of course you could use an array instead of the stack if you wanted but that's not necessarily).

(also replace the ret with a corresponding jmp obviously)

_________________
Previously known as The_Grey_Beast
Post 30 Dec 2009, 17:23
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 724
tthsqe
Ah, so you were talking about only keeping track of TWO values. I gave that a shot, but the results were about 50% slower than the built in function:

a = 0; b = 1;
Accumulate[
Table[Timing[c = (b - a)^2; {a, b} = {a^2 + c, b^2 + c};][[1]],
{i, 1, 30}]][[18 ;; 30]]

{0.015, 0.047, 0.078, 0.171, 0.374, 0.78, 1.841, 4.118, 8.798, 18.361, 38.719, 81.713, 178.371}

The built in function probably uses a host of other optimizations as well as the one you mentioned. (The timing must be accumulated because although mathematica caches computed results, this seems not affect the calculation of further results)
Post 31 Dec 2009, 00:49
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2913
Location: [RSP+8*5]
bitRAKE
I'd rather code it in x86 and compare. Razz

Mathematica is awesome, but I no longer have a license.
Post 31 Dec 2009, 04:27
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 724
tthsqe
Mathematica is awesome, but the new version 7 has some kind debugger deflector, so you can' see what's going on in ollydbg. Sad
Quote:

I'd rather code it in x86 and compare.

Please do - I would not be uninterested in the results. But, you are going to have to code a good multiprecision multiplier (f(2^30) is about 10 million qwords), and I hear that mathematica has adopted the GMP library just for this reason.
Post 31 Dec 2009, 06:05
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2913
Location: [RSP+8*5]
bitRAKE
Don't you mean a good specialized squaring routined? Mathematica has adopted much open source on the back-end from what I've read.
Post 31 Dec 2009, 15:15
View user's profile Send private message Visit poster's website Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
Can't you use floating point though?

bitRAKE wrote:
Mathematica is awesome
Mathematica is bloated.

For computations (not symbolic) use PARI. Light and very fast.

_________________
Previously known as The_Grey_Beast
Post 31 Dec 2009, 15:32
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2913
Location: [RSP+8*5]
bitRAKE
True, from a certain perspective Mathematica is bloated, but we cannot ignore the wider audience and scope of utility covered by Mathematica - from the initial design Wolfram has had a desire to bring mathematics to a wider audience as well as easing the digital use of existing mathematics in education. I have much respect for this - both the man and his work.
Post 01 Jan 2010, 02:59
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 724
tthsqe
borsuc,
what do you have in mind for a floating point approach?
Post 01 Jan 2010, 04:14
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2913
Location: [RSP+8*5]
bitRAKE
tthsqe wrote:
f(2^30) is about 10 million qwords
What is your memory bus? It will give me a better idea of how much data I need to plow through in 100 seconds - I'll ignore caching for now. (My system can read/write 4GB/s to main memory - slow compared to modern buses.)

_________________
¯\(°_o)/¯ unlicense.org
Post 01 Jan 2010, 20:48
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 724
tthsqe
I'm not sure how to find that out. Here is the cpuz report, though.


Description: 32 bit vista
Download
Filename: pc.txt
Filesize: 70.24 KB
Downloaded: 302 Time(s)

Post 02 Jan 2010, 04:18
View user's profile Send private message Reply with quote
Alphonso



Joined: 16 Jan 2007
Posts: 294
Alphonso
What is this code really about?
Code:
f:     ;argument passed in eax
    cmp eax,1
    jbe .1
    push ebx
    lea ebx,[eax-2]
    dec eax
    call f
    xchg eax,ebx
    call f
    add eax,ebx
    pop ebx
.1  ret       
Since this code can only work up to N=47 then why not use a LUT? Then it would be millions of times quicker than the above. Don't take this the wrong way but TBH I find it hard to believe the above code could beat anything except maybe Loco's code. Laughing Keep in mind I'm just a layman so if you could explain while leaving out the iterative, recursion and Binet's etc it would help as I feel I'm not understanding something / everything here. Confused

It also seems the thread has turned into a 'fibber what's his name' optimization. Just out of curiosity how long does it take mathematica to produce an answer for say fib(1,000,000)?

Happy new year to everyone, Very Happy even though I'm a day late.
Post 02 Jan 2010, 04:46
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Alphonso, it computes this:
Code:
f(0) = 1
f(1) = 1
f(n) = (f(n-1) + f(n-2))(mod 2^32)    
Wink

The LUT is of course the way to go for computing the real fib inside the bounds of not-astronomically-big unsigned integers but the idea was not to use precomputed values (note that in any moment the discussion was directed to real-life use except for the arbitrary precision).

About the direction of this thread yeah, I was thinking about merging part of it into the other thread that was specific of fib.
Post 02 Jan 2010, 05:53
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2913
Location: [RSP+8*5]
bitRAKE
tthsqe wrote:
I'm not sure how to find that out. Here is the cpuz report, though.
Here is an interesting memory speed tester. My old Pentium M laptop does 1.5GB/s - yours might be better than twice that with two cores?

_________________
¯\(°_o)/¯ unlicense.org
Post 02 Jan 2010, 06:10
View user's profile Send private message Visit poster's website Reply with quote
Alphonso



Joined: 16 Jan 2007
Posts: 294
Alphonso
LocoDelAssembly wrote:
The idea was not to use precomputed values
Ok
Code:
f:   cmp   eax,1
     jbe   .1
     mov   ecx,eax
     xor   eax,eax
     mov   ebx,eax
     inc   ebx
@@:
     push  eax
     add   eax,ebx
     pop   ebx
     dec   ecx
     jnz   @b
.1:  ret    
This should do the same, limited to N=47, but take nano seconds instead of 'time to go for a coffee' time. Perhaps it's because the original code is limited to 47 that I'm confused whereas your really looking for up to 2^32-1, or maybe not. Probably best if I just keep quiet and let you guys do your stuff. Confused
Post 02 Jan 2010, 07:23
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17270
Location: In your JS exploiting you and your system
revolution
Alphonso: You may have missed the discussion about sticking with the (inefficient) recursive algorithm for the purposes of testing. So your standard linear approach above, while certainly much faster, was not the intended purpose of all the past posts in here.

Although, I am at a loss to understand what the recursive test is supposed to prove or solve?
Post 02 Jan 2010, 07:52
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
It started from here when tthsqe was talking about the inefficient Assembly code his compilers generated and that his own was the best. Then I've shown my code which was better on my AMD, then there was also some discussion about using iteration with an stack, and then... Well, you'll have to read Razz
Post 02 Jan 2010, 08:11
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 724
tthsqe
Alphonso,
I guess the point was the obvious fact that it the really put the effort in you can gain some improvements over some compilers - they just don't know where to put the variables in the case of that c code.
mathematica timings for various values:
f(1000^2): 0.05
f(1024^2): 0.05
f(1000^3): 106
f(1024^3): 112
Also, it was mentioned before, but the xadd operation can replace your push, add, pop sequence. don't forget about xadd !

bitRAKE,
how did you get that I have two cores? only one 64 bit core trapped in a 32 bit OS ....
anyways, the result for three runs of the program was 1.83, 2.08, 1.84 GB/s
Post 02 Jan 2010, 08:41
View user's profile Send private message Reply with quote
Alphonso



Joined: 16 Jan 2007
Posts: 294
Alphonso
Thanks for those numbers tthsqe. I tried going up to f(1 million) using the add method and three streams but it took about 2 minutes, so pretty slow really. Thanks for the tip on XADD but it seems to run 25% slower than a push-add-pop on my laptop.

Code:
;;Test XADD vs Push-Add-Pop

format PE GUI 4.0
include 'win32a.inc'

TestTime=0x10000000

;========================================
section '.code' code readable executable
;----------------------------------------;Loop Time Check
        invoke  GetTickCount
        mov     [TStart],eax
        xor     eax,eax
        mov     ebx,1

align 16
        mov     ecx,TestTime
@@:
        dec     ecx
        jnz     @b

        invoke  GetTickCount
        sub     eax,[TStart]
        mov     [LoopTime],eax
;----------------------------------------;PushPop Time Check
        invoke  GetTickCount
        mov     [TStart],eax
        xor     eax,eax
        mov     ebx,1
align 16
        mov     ecx,TestTime
@@:
        push    eax
        add     eax,ebx
        pop     ebx
        dec     ecx
        jnz     @b

        invoke  GetTickCount
        sub     eax,[TStart]
        mov     [PushPopTime],eax
;----------------------------------------;Xadd Time Check
        invoke  GetTickCount
        mov     [TStart],eax
        xor     eax,eax
        mov     ebx,1
align 16
        mov     ecx,TestTime
@@:
        xadd    eax,ebx
        dec     ecx
        jnz     @b

        invoke  GetTickCount
        sub     eax,[TStart]
        mov     [XaddTime],eax
;----------------------------------------;Results
        mov     eax,[LoopTime]
        sub     [PushPopTime],eax
        sub     [XaddTime],eax
        cinvoke wsprintf,Buff,wsformat,eax,[PushPopTime],[XaddTime]
        invoke  MessageBox,0,Buff,Tit,1 MB_RIGHT or MB_ICONINFORMATION
        invoke  ExitProcess,0
;========================================
section '.data' data readable writeable
  TStart        dd ?
  LoopTime      dd ?
  PushPopTime   dd ?
  XaddTime      dd ?
  Tit           db 'P-P vs XADD',0
  wsformat      db 'Loop Time : %04ums',10,'Push Pop Time : %04ums',10,'Xadd Time : %04ums',0
  Buff          rb 128
;========================================
section '.idata' import data readable writeable
  library kernel,'KERNEL32.DLL',\
          user,'USER32.DLL'
  import  kernel,\
          GetTickCount,'GetTickCount',\
          ExitProcess,'ExitProcess'
  import  user,\
          MessageBox,'MessageBoxA',\
          wsprintf,'wsprintfA'
;========================================
    
Post 02 Jan 2010, 14:12
View user's profile Send private message Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
My initial suggestion with float was because of the much higher dynamic range, unfortunately addition doesn't do very well in floats and since fibonacci sequence is additions the precision loss might be huge. It may not have been such a good idea.
Post 02 Jan 2010, 19:39
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3, 4, 5  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar.

Powered by rwasa.