flat assembler
Message board for the users of flat assembler.

Index > Main > Code optimization (AKA C vs. Asm)

Goto page Previous  1, 2, 3, 4, 5  Next
Author
Thread Post new topic Reply to topic
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17278
Location: In your JS exploiting you and your system
revolution
Borsuc wrote:
Recursion by function-calling is NEVER needed. Why push the return address, at the very least, all the time, you know it won't change. Never understood the idea behind it. Confused

If you need a hierarchic recursion, then just use a custom array to hold it -- why pass it as duplicated parameters and return addresses? Confused
Code example please.
Post 29 Dec 2009, 04:17
View user's profile Send private message Visit poster's website Reply with quote
Alphonso



Joined: 16 Jan 2007
Posts: 294
Alphonso
Added an 'align 16' for the f code and got these results on a C2D P8400 @ 3GHz. Best of 3 runs. Exchanged 'dec eax' with 'sub eax,1'

Code:
f using XCHG with 'dec eax'                              Tiempo = 38626 ms ; Resultado = 2971215073

f using XCHG with 'sub eax,1'                               Tiempo = 32807 ms ; Resultado = 2971215073

fib using XCHG with 'dec eax'                               Tiempo = 38439 ms ; Resultado = 2971215073

fib using XCHG with 'sub eax,1'                             Tiempo = 36020 ms ; Resultado = 2971215073
    
A couple of nop's with dec in f code is an improvent as well.


Code:
fib using replacement for XCHG (ecx) with 'dec eax'        Tiempo = 38584 ms ; Resultado = 2971215073

fib using replacement for XCHG (ecx) with 'sub eax,1'       Tiempo = 35989 ms ; Resultado = 2971215073
    

Looking at your results Loco, it seems the Athlon64 doesn't play well with the XCHG opcode. I haven't done any C but I would assume code could only at best be as good as your compiler. If the compiler is written very well then you should be able to produce very good code without spending too much time optimizing. JMO.
Post 29 Dec 2009, 05:09
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Damn, I forgot to delete "f", I actually tested "f" aligned because I put it just after "fib" label, but then I've moved it to test my code again in case the commented timings were wrong (due to changes in the OS and/or environment since the timings were retrieved).
Post 29 Dec 2009, 06:05
View user's profile Send private message Reply with quote
DOS386



Joined: 08 Dec 2006
Posts: 1901
DOS386
This looks interesting Smile Someone please upload the complete testcase: FASM + English, maybe also + C code, OTOH no Espagnol or C++ please Wink
Post 29 Dec 2009, 06:55
View user's profile Send private message Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
revolution wrote:
Code example please.
Code example for what?

Normal (stupid) recursion:
Code:
[return address]
[param1]
[...]

[return address]
[param1]
[...]    


Smart recursion:
Code:
[param1]
[...]
[param1]
[...]    
Furthermore if param1 doesn't change there's no reason to put it on the array/stack (stack is just an array), so let's assume all parameters except for 1 are static, then:

Code:
[...]  ; < the other params

[param1]
[param1]
[param1]    
and of course you would use 'jmp' instead of 'call' I thought that was pretty obvious.

You know actually it's very easy to imagine this even in a HLL: all you need is a simple iterative loop, and an array where you store the hierarchy and expand the array. Example: when scanning folders recursively, you just store the current folder pointer/ID/handle/whatever in the array, then go again and store the (new) current folder in the array (of course you append it to the array). When you run out of folders you simply decrease the array pointer to the previous element and loop again.

That's actually what stupid recursion does except that it also pushes duplicated shit like return addresses and static "local" variables.

_________________
Previously known as The_Grey_Beast
Post 29 Dec 2009, 18:16
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
Simple (single tailed) recursion is just an abstraction of iteration.
Like a recursive factorial calculator F[n>1](n) = n * F(n-1).
In this case recursion is just a convenient means of representation of an iterative process.

However, advanced recursion (two tailed) for instance like a quick sort algorithm would be more complex/slower if implemented iteratively because you'd be using a software implemented stack.
Post 29 Dec 2009, 19:20
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Code:
format PE console 4.0
entry start

include 'win32ax.inc'
N equ 44

macro tester func
{
align 64

      invoke  Sleep, 1000

      xor     eax, eax
      cpuid

      invoke  GetTickCount
      mov     [timestart], eax

      mov     eax, N
      call    func
      push    eax

; Serialize
      xor     eax, eax
      cpuid

      invoke  GetTickCount
      sub     eax, [timestart]

      push    eax
      call    @f
      db      `func, 0

@@:
      push    fmt
      call    [printf]
      add     esp, 12
}
macro fEntry func
{
align 64
func:
}

section '.text' code readable executable

proc start
local ProcessAffinityMask:DWORD, SystemAffinityMask:DWORD

      invoke   GetCurrentProcess
      mov      ebx, eax
      invoke   SetPriorityClass, eax, REALTIME_PRIORITY_CLASS
      invoke   GetCurrentThread
      invoke   SetThreadPriority, eax, THREAD_PRIORITY_TIME_CRITICAL

      invoke   GetProcessAffinityMask, ebx, addr ProcessAffinityMask, addr SystemAffinityMask
      test     eax, eax
      jz       .beginTests

; Lets choose only one of the processors allowed for this process
      mov      esi, 1
      bsf      ecx, [ProcessAffinityMask]
      shl      esi, cl

; Now, since Win9x/Me is smart enough to have GetProcessAffinityMask but not its counterpart we must check its existence first

      invoke   GetModuleHandle, 'KERNEL32.DLL'
      test     eax, eax
      jz       .beginTests

      invoke   GetProcAddress, eax, 'SetProcessAffinityMask'
      test     eax, eax
      jz       .beginTests

      stdcall  eax, ebx, esi

;;;;;;;;;;;;;;;;;;;;;;;;;;

.beginTests:
      tester f_xchg_dec
      tester f_xchg_sub

      tester fib_xchg_dec
      tester fib_xchg_sub

      tester fib_no@xchg_dec
      tester fib_no@xchg_sub

      push    _pause
      call    [system]

      add     esp,4
      invoke  ExitProcess, 0
endp

;;;;;;;;;;;;;;;;;;;;;;;;;;

fEntry f_xchg_dec
    cmp eax,1
    jbe .1
    push ebx
    lea ebx,[eax-2]
    dec eax
    call f_xchg_dec
    xchg eax,ebx
    call f_xchg_dec
    add eax,ebx
    pop ebx
.1: ret

fEntry f_xchg_sub
    cmp eax,1
    jbe .1
    push ebx
    lea ebx,[eax-2]
    sub eax, 1
    call f_xchg_sub
    xchg eax,ebx
    call f_xchg_sub
    add eax,ebx
    pop ebx
.1: ret

fEntry fib_xchg_dec
  cmp     eax, 1 ; Base case?
  jbe     .return_eax

  push    ebx

  mov     ebx, eax
  dec     eax
  call    fib_xchg_dec

  xchg    ebx, eax
  sub     eax, 2
  call    fib_xchg_dec

; eax = Fib(n-2) ; ebx = Fib(n-1)

  add     eax, ebx ; eax = Fib(n-1) + Fib(n-2)

  pop     ebx

.return_eax:
  ret

fEntry fib_xchg_sub
  cmp     eax, 1 ; Base case?
  jbe     .return_eax

  push    ebx

  mov     ebx, eax
  sub     eax, 1
  call    fib_xchg_sub

  xchg    ebx, eax
  sub     eax, 2
  call    fib_xchg_sub

; eax = Fib(n-2) ; ebx = Fib(n-1)

  add     eax, ebx ; eax = Fib(n-1) + Fib(n-2)

  pop     ebx

.return_eax:
  ret


fEntry fib_no@xchg_dec
  cmp     eax, 1 ; Base case?
  jbe     .return_eax

  push    ebx

  mov     ebx, eax
  dec     eax
  call    fib_no@xchg_dec

  lea     ecx, [ebx-2]
  mov     ebx, eax
  mov     eax, ecx
  call    fib_no@xchg_dec

; eax = Fib(n-2) ; ebx = Fib(n-1)

  add     eax, ebx ; eax = Fib(n-1) + Fib(n-2)

  pop     ebx

.return_eax:
  ret

fEntry fib_no@xchg_sub
  cmp     eax, 1 ; Base case?
  jbe     .return_eax

  push    ebx

  mov     ebx, eax
  sub     eax, 1
  call    fib_no@xchg_sub

  lea     ecx, [ebx-2]
  mov     ebx, eax
  mov     eax, ecx
  call    fib_no@xchg_sub

; eax = Fib(n-2) ; ebx = Fib(n-1)

  add     eax, ebx ; eax = Fib(n-1) + Fib(n-2)

  pop     ebx

.return_eax:
  ret

section '.data' data readable writable
timestart dd 0

fmt    db "%s: %dms; Result = %u", 10, 0
_pause db "pause", 0

section '.idata' import readable

  library msvcrt, 'msvcrt.dll',\
    kernel32, 'kernel32.dll'
    
  import msvcrt,\
    printf, 'printf',\
    system, 'system'
    
  include 'api/kernel32.inc'    


WinXP SP3; Athlon64 Venice 2.0GHz
Code:
f_xchg_dec: 17562ms; Result = 701408733
f_xchg_sub: 16953ms; Result = 701408733
fib_xchg_dec: 17703ms; Result = 701408733
fib_xchg_sub: 17469ms; Result = 701408733
fib_no@xchg_dec: 8672ms; Result = 701408733
fib_no@xchg_sub: 8156ms; Result = 701408733
Presione una tecla para continuar . . .
f_xchg_dec: 17484ms; Result = 701408733
f_xchg_sub: 16875ms; Result = 701408733
fib_xchg_dec: 17610ms; Result = 701408733
fib_xchg_sub: 17390ms; Result = 701408733
fib_no@xchg_dec: 8625ms; Result = 701408733
fib_no@xchg_sub: 8110ms; Result = 701408733
Presione una tecla para continuar . . .
f_xchg_dec: 17484ms; Result = 701408733
f_xchg_sub: 16875ms; Result = 701408733
fib_xchg_dec: 17610ms; Result = 701408733
fib_xchg_sub: 17390ms; Result = 701408733
fib_no@xchg_dec: 8625ms; Result = 701408733
fib_no@xchg_sub: 8110ms; Result = 701408733
Presione una tecla para continuar . . .    


WARNING: The code will hung your computer for a while, but between tests it will breath. You can also cancel execution with Control-C, and if you press the sequence while the computer is hung the program will be terminated on the next Sleep call. In any case don't shutdown/reset your computer, the program will halt sooner or later.

Borsuc, if possible write your own fib using your recursion method (don't cheat by using the iterative approach though because we already know it is O(n)).

PS: I've reduced N to 44 because it is good enough to see timing differences and is obviously faster.
[edit]Times with N=47
Code:
f_xchg_dec: 74078ms; Result = 2971215073
f_xchg_sub: 71484ms; Result = 2971215073
fib_xchg_dec: 74625ms; Result = 2971215073
fib_xchg_sub: 73703ms; Result = 2971215073
fib_no@xchg_dec: 36562ms; Result = 2971215073
fib_no@xchg_sub: 34375ms; Result = 2971215073    
Post 29 Dec 2009, 19:33
View user's profile Send private message Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
r22 wrote:
However, advanced recursion (two tailed) for instance like a quick sort algorithm would be more complex/slower if implemented iteratively because you'd be using a software implemented stack.
What do you mean by software implemented stack? You mean an array? You can use the stack directly if you want too. Only difference is when you call the function at multiple addresses -- then you may need to pass on some information to identify them.

LocoDelAssembly wrote:
Borsuc, if possible write your own fib using your recursion method (don't cheat by using the iterative approach though because we already know it is O(n)).
this is a case of accumulative recursion so obviously the simplest method is iteration.

But even if you necessarily want to push the numbers on the stack, and since the function calls itself twice, you will have to push an identifier -- ideally 1 bit -- which can be used to index the "return address" instead.

In this particular case, the program might be slower, since the function is very simple (just an addition!), but for large numbers you won't waste as much memory...

_________________
Previously known as The_Grey_Beast


Last edited by Borsuc on 29 Dec 2009, 21:40; edited 1 time in total
Post 29 Dec 2009, 20:45
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2914
Location: [RSP+8*5]
bitRAKE
Fibonacci only needs XADD/LOOP instructions.

"Recursion is self-explanatory" kind of says why recursion is used - the code is simplier. Usually, this costs some in performance. My BlobDrop game has a multi-tail example of this simplicity (works like a flood-fill algorithm):
Code:
.recurse.0:
     push ebp
.recurse:
   cmp [ebp-4*(GAME.COLUMNS+1)],ebx
    je .Up
.rUp:     cmp [ebp+4],ebx
     je .Right
.rRight:cmp [ebp+4*(GAME.COLUMNS+1)],ebx
   je .Down
.rDown: cmp [ebp-4],ebx
     je .Left
.rLeft: pop ebp
     retn

.Up:    mov [ebp-4*(GAME.COLUMNS+1)],eax
    push .rUp
   push ebp
    sub ebp,4*(GAME.COLUMNS+1)
  jmp .recurse

.Right: mov [ebp+4],eax
     push .rRight
        push ebp
    add ebp,4
   jmp .recurse

.Down:  mov [ebp+4*(GAME.COLUMNS+1)],eax
    push .rDown
 push ebp
    add ebp,4*(GAME.COLUMNS+1)
  jmp .recurse

.Left:  mov [ebp-4],eax
     push .rLeft
 push ebp
    sub ebp,4
   jmp .recurse    
...no attempt was made to optimize the code. .Left obviously can be short cutted to .recurse. Also, data adjacency can be exploited to beter use cache.

I can see how creating a common return address would reduce the amount of data on the stack (from 32-bit return address to a 2-bit number), and the least 2-bits of the EBP register would be used to store this index number. Additionally, we would need to add special handling for a halt state. This is not a problem because EBP would never be zero. This would add 5/6 instructions to the loop? It would certainly be faster beyond some depth (beyond practicality for my use).

Could a compiler know how to make this transformation, though? Certainly, not.


Last edited by bitRAKE on 29 Dec 2009, 21:49; edited 1 time in total
Post 29 Dec 2009, 21:22
View user's profile Send private message Visit poster's website Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
bitRAKE wrote:
"Recursion is self-explanatory" kind of says why recursion is used - the code is simplier.
To be honest with you, I find iteration much simpler to comprehend, takes me a while to figure out recursion in more advanced algorithms.

_________________
Previously known as The_Grey_Beast
Post 29 Dec 2009, 21:44
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2914
Location: [RSP+8*5]
bitRAKE
Yeah, now that you mention it - an iterative version might even be simplier (and faster) for the above code. Wouldn't the worst case depth be the same, too? Another approach (used in my Hilbert curve examples) would be reverse the transformation on EBP for each step - requiring only 2-bit storage for each depth step.


Win7 x64; Xeon L5410 2.3GHz, FSB 1333Mhz (important here?),
computer doesn't hang at all. Smile

best of three runs N=44:
Code:
f_xchg_dec: 10093ms; Result = 701408733
f_xchg_sub: 9719ms; Result = 701408733
fib_xchg_dec: 10842ms; Result = 701408733
fib_xchg_sub: 14040ms; Result = 701408733
fib_no@xchg_dec: 11044ms; Result = 701408733
fib_no@xchg_sub: 10546ms; Result = 701408733    
Single run, N=47:
Code:

f_xchg_dec: 42526ms; Result = 2971215073
f_xchg_sub: 41402ms; Result = 2971215073
fib_xchg_dec: 46956ms; Result = 2971215073
fib_xchg_sub: 59452ms; Result = 2971215073
fib_no@xchg_dec: 46925ms; Result = 2971215073
fib_no@xchg_sub: 45069ms; Result = 2971215073    
Post 29 Dec 2009, 22:32
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Quote:

computer doesn't hang at all.
Advantage of multi-core owners Wink (Although for uni-core owners it will not hang the computer with unprivileged accounts neither)

Borsuc, OK, but then remember that you'll have branch misprediction penalties that way, the RET instruction has the so called return address buffer which gives extra help to the prediction of the branch target address.
Post 29 Dec 2009, 23:35
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 724
tthsqe
@LocoDelAssembly
I don't say code is slower just by looking at it. Smile I know better than that now.
The ICC output was almost twice as long but executed only about 5-10% slower and VC was about 25-30% slower.

so it looks like intel's cpus are better at xchg here than amd's and that sub has a slight advantage over dec. Does anybody have an implementation that is both faster and smaller than f_xchg_sub? something avoiding all of the return addresses would be of interest, but keep it at exponetial running time (naive recursion).

PS: a very good way of actually computing these numbers using integers is to take the n th power of the matrix 11 10 using exponentiation by squaring (or something else with operation complexity O(log(n))
Post 30 Dec 2009, 04:01
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Quote:

PS: a very good way of actually computing these numbers using integers is to take the n th power of the matrix 11 10 using exponentiation by squaring (or something else with operation complexity O(log(n))
I got very exited when reading this but now I'm in doubt. Can this method actually be faster than iteration? Matrix multiplication is somewhat expensive so you'll need to consider a big N to see the benefits. But then, since fib goes out of range pretty fast you'll need bignums too which in the iterative method you'll have O(n^2) and with the matrix method O((n+n^2)*log(n)) = O(n^2*log(n)) (assuming naive multiplication here). If the best known multiplication algorithm really is O(n*log(n)) then it would be O((n+n*log(n))*log(n)) = O(n*log(n)*log(n)).

Obviously O(n^2)>O(n*log(n)*log(n)), but how long N should be to see the benefits in practice?
Post 30 Dec 2009, 06:35
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17278
Location: In your JS exploiting you and your system
revolution
Binet's formula = O(1) or O(log(n))?
Post 30 Dec 2009, 06:45
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 724
tthsqe
@LocoDelAssembly:
Not all of those multiplications are as big as the last ones, so it shouldn't it be a little less? [Edit: doing the math, assuming that we have:
1. log(n) steps
2. at the the i th step, c^i digit numbers
3. a nice n log n multiplication
we get
sum_{i=1}^{log n}{i*c^i}, and this is O(n log(n)). End Edit]

Anyways, here is mathematica 7 on a 1.6 GHz Celeron M:

Table[Timing[Fibonacci[2^n];][[1]], {n, 18, 30}]
{0.015, 0.016, 0.047, 0.093, 0.219, 0.514, 1.217, 2.637, 5.678, 11.762, 24.742, 52.307, 113.741}

%/(10^-9/9 2^Range[18, 30] Range[18, 30]^2)
{1.58946, 0.760826, 1.00851, 0.905017, 0.970915, 1.04246, 1.13342, 1.13168, 1.12645, 1.0819, 1.05809, 1.04264, 1.0593}

I would assume it uses some something based on the matrix power. Also, it looks like your prediction of O(n log(n)^2) is right on. (the running lime here seems to be assmptotic to 1.1*^-10*n*log(n)^2). [Edit: n log n also looks good. End Edit]

The question here is: can we beat mathematica? Smile

@revolution:
in order to compute the integer part of phi^n, we have to know phi to O(n) decimal places, so it still comes down to multiprecision multiplications, and the time complexity is higher than the operation complexity.


Last edited by tthsqe on 30 Dec 2009, 10:51; edited 1 time in total
Post 30 Dec 2009, 08:49
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2914
Location: [RSP+8*5]
bitRAKE
tthsqe wrote:
The question here is: can we beat mathematica? Smile
Sure.
  • fib(2n) = fib(n-1) * fib(n) + fib(n) * fib(n+1)
  • fib(2n+1) = fib(n)^2 + fib(n+1)^2
  • fib(2n-1) = fib(n-1)^2 + fib(n)^2

Which means that, given fib(n-1), fib(n) and fib(n+1), we can calculate without iteration fib(2n-1), fib(2n) and fib(2n+1). The code of the 'fibonacci-matrix' function can be replaced to use and return these three values rather than a matrix.

_________________
¯\(°_o)/¯ unlicense.org
Post 30 Dec 2009, 09:31
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 724
tthsqe
ritRAKE,
I'm not sure what to make of your post. The equations you give follow directly from the fact that {{1,0},{1,1}}^n is a matrix of f(n-1) f(n) and f(n+1) (I think the anti diagonal elements are both f(n)). Since n was is power of 2, everything I've said points to the conjecture that mathematica is using exactly the equations you have written down. Confused
Post 30 Dec 2009, 09:57
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
Borsuc wrote:
You know actually it's very easy to imagine this even in a HLL: all you need is a simple iterative loop, and an array where you store the hierarchy and expand the array. Example: when scanning folders recursively, you just store the current folder pointer/ID/handle/whatever in the array, then go again and store the (new) current folder in the array (of course you append it to the array). When you run out of folders you simply decrease the array pointer to the previous element and loop again.

For something like scanning folder trees, you don't really gain much (except code complexity!) from implementing in an iterative way... what's the largest tree depth you've seen on normal systems? Smile

r22 wrote:
However, advanced recursion (two tailed) for instance like a quick sort algorithm would be more complex/slower if implemented iteratively because you'd be using a software implemented stack.
Microsofts libc has a hybrid quicksort implementation with tri-median pivot selection (uses insertion sort after a certain cutoff size), and the quicksort part is implemented iteratively... you eliminate both CALL/RET overhead and stack usage, as well as a bunch of local variables.
Post 30 Dec 2009, 12:23
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
@fodder - agreed, quicksort was a poor choice for an example/illustration.
Post 30 Dec 2009, 14:36
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3, 4, 5  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar.

Powered by rwasa.