flat assembler
Message board for the users of flat assembler.
Index
> Main > Code optimization (AKA C vs. Asm) Goto page Previous 1, 2, 3, 4, 5 Next 
Author 

Alphonso
Added an 'align 16' for the f code and got these results on a C2D P8400 @ 3GHz. Best of 3 runs. Exchanged 'dec eax' with 'sub eax,1'
Code: f using XCHG with 'dec eax' Tiempo = 38626 ms ; Resultado = 2971215073 f using XCHG with 'sub eax,1' Tiempo = 32807 ms ; Resultado = 2971215073 fib using XCHG with 'dec eax' Tiempo = 38439 ms ; Resultado = 2971215073 fib using XCHG with 'sub eax,1' Tiempo = 36020 ms ; Resultado = 2971215073 Code: fib using replacement for XCHG (ecx) with 'dec eax' Tiempo = 38584 ms ; Resultado = 2971215073 fib using replacement for XCHG (ecx) with 'sub eax,1' Tiempo = 35989 ms ; Resultado = 2971215073 Looking at your results Loco, it seems the Athlon64 doesn't play well with the XCHG opcode. I haven't done any C but I would assume code could only at best be as good as your compiler. If the compiler is written very well then you should be able to produce very good code without spending too much time optimizing. JMO. 

29 Dec 2009, 05:09 

LocoDelAssembly
Damn, I forgot to delete "f", I actually tested "f" aligned because I put it just after "fib" label, but then I've moved it to test my code again in case the commented timings were wrong (due to changes in the OS and/or environment since the timings were retrieved).


29 Dec 2009, 06:05 

DOS386
This looks interesting Someone please upload the complete testcase: FASM + English, maybe also + C code, OTOH no Espagnol or C++ please


29 Dec 2009, 06:55 

Borsuc
revolution wrote: Code example please. Normal (stupid) recursion: Code: [return address] [param1] [...] [return address] [param1] [...] Smart recursion: Code: [param1] [...] [param1] [...] Code: [...] ; < the other params [param1] [param1] [param1] You know actually it's very easy to imagine this even in a HLL: all you need is a simple iterative loop, and an array where you store the hierarchy and expand the array. Example: when scanning folders recursively, you just store the current folder pointer/ID/handle/whatever in the array, then go again and store the (new) current folder in the array (of course you append it to the array). When you run out of folders you simply decrease the array pointer to the previous element and loop again. That's actually what stupid recursion does except that it also pushes duplicated shit like return addresses and static "local" variables. _________________ Previously known as The_Grey_Beast 

29 Dec 2009, 18:16 

r22
Simple (single tailed) recursion is just an abstraction of iteration.
Like a recursive factorial calculator F[n>1](n) = n * F(n1). In this case recursion is just a convenient means of representation of an iterative process. However, advanced recursion (two tailed) for instance like a quick sort algorithm would be more complex/slower if implemented iteratively because you'd be using a software implemented stack. 

29 Dec 2009, 19:20 

LocoDelAssembly
Code: format PE console 4.0 entry start include 'win32ax.inc' N equ 44 macro tester func { align 64 invoke Sleep, 1000 xor eax, eax cpuid invoke GetTickCount mov [timestart], eax mov eax, N call func push eax ; Serialize xor eax, eax cpuid invoke GetTickCount sub eax, [timestart] push eax call @f db `func, 0 @@: push fmt call [printf] add esp, 12 } macro fEntry func { align 64 func: } section '.text' code readable executable proc start local ProcessAffinityMask:DWORD, SystemAffinityMask:DWORD invoke GetCurrentProcess mov ebx, eax invoke SetPriorityClass, eax, REALTIME_PRIORITY_CLASS invoke GetCurrentThread invoke SetThreadPriority, eax, THREAD_PRIORITY_TIME_CRITICAL invoke GetProcessAffinityMask, ebx, addr ProcessAffinityMask, addr SystemAffinityMask test eax, eax jz .beginTests ; Lets choose only one of the processors allowed for this process mov esi, 1 bsf ecx, [ProcessAffinityMask] shl esi, cl ; Now, since Win9x/Me is smart enough to have GetProcessAffinityMask but not its counterpart we must check its existence first invoke GetModuleHandle, 'KERNEL32.DLL' test eax, eax jz .beginTests invoke GetProcAddress, eax, 'SetProcessAffinityMask' test eax, eax jz .beginTests stdcall eax, ebx, esi ;;;;;;;;;;;;;;;;;;;;;;;;;; .beginTests: tester f_xchg_dec tester f_xchg_sub tester fib_xchg_dec tester fib_xchg_sub tester fib_no@xchg_dec tester fib_no@xchg_sub push _pause call [system] add esp,4 invoke ExitProcess, 0 endp ;;;;;;;;;;;;;;;;;;;;;;;;;; fEntry f_xchg_dec cmp eax,1 jbe .1 push ebx lea ebx,[eax2] dec eax call f_xchg_dec xchg eax,ebx call f_xchg_dec add eax,ebx pop ebx .1: ret fEntry f_xchg_sub cmp eax,1 jbe .1 push ebx lea ebx,[eax2] sub eax, 1 call f_xchg_sub xchg eax,ebx call f_xchg_sub add eax,ebx pop ebx .1: ret fEntry fib_xchg_dec cmp eax, 1 ; Base case? jbe .return_eax push ebx mov ebx, eax dec eax call fib_xchg_dec xchg ebx, eax sub eax, 2 call fib_xchg_dec ; eax = Fib(n2) ; ebx = Fib(n1) add eax, ebx ; eax = Fib(n1) + Fib(n2) pop ebx .return_eax: ret fEntry fib_xchg_sub cmp eax, 1 ; Base case? jbe .return_eax push ebx mov ebx, eax sub eax, 1 call fib_xchg_sub xchg ebx, eax sub eax, 2 call fib_xchg_sub ; eax = Fib(n2) ; ebx = Fib(n1) add eax, ebx ; eax = Fib(n1) + Fib(n2) pop ebx .return_eax: ret fEntry fib_no@xchg_dec cmp eax, 1 ; Base case? jbe .return_eax push ebx mov ebx, eax dec eax call fib_no@xchg_dec lea ecx, [ebx2] mov ebx, eax mov eax, ecx call fib_no@xchg_dec ; eax = Fib(n2) ; ebx = Fib(n1) add eax, ebx ; eax = Fib(n1) + Fib(n2) pop ebx .return_eax: ret fEntry fib_no@xchg_sub cmp eax, 1 ; Base case? jbe .return_eax push ebx mov ebx, eax sub eax, 1 call fib_no@xchg_sub lea ecx, [ebx2] mov ebx, eax mov eax, ecx call fib_no@xchg_sub ; eax = Fib(n2) ; ebx = Fib(n1) add eax, ebx ; eax = Fib(n1) + Fib(n2) pop ebx .return_eax: ret section '.data' data readable writable timestart dd 0 fmt db "%s: %dms; Result = %u", 10, 0 _pause db "pause", 0 section '.idata' import readable library msvcrt, 'msvcrt.dll',\ kernel32, 'kernel32.dll' import msvcrt,\ printf, 'printf',\ system, 'system' include 'api/kernel32.inc' WinXP SP3; Athlon64 Venice 2.0GHz Code: f_xchg_dec: 17562ms; Result = 701408733 f_xchg_sub: 16953ms; Result = 701408733 fib_xchg_dec: 17703ms; Result = 701408733 fib_xchg_sub: 17469ms; Result = 701408733 fib_no@xchg_dec: 8672ms; Result = 701408733 fib_no@xchg_sub: 8156ms; Result = 701408733 Presione una tecla para continuar . . . f_xchg_dec: 17484ms; Result = 701408733 f_xchg_sub: 16875ms; Result = 701408733 fib_xchg_dec: 17610ms; Result = 701408733 fib_xchg_sub: 17390ms; Result = 701408733 fib_no@xchg_dec: 8625ms; Result = 701408733 fib_no@xchg_sub: 8110ms; Result = 701408733 Presione una tecla para continuar . . . f_xchg_dec: 17484ms; Result = 701408733 f_xchg_sub: 16875ms; Result = 701408733 fib_xchg_dec: 17610ms; Result = 701408733 fib_xchg_sub: 17390ms; Result = 701408733 fib_no@xchg_dec: 8625ms; Result = 701408733 fib_no@xchg_sub: 8110ms; Result = 701408733 Presione una tecla para continuar . . . WARNING: The code will hung your computer for a while, but between tests it will breath. You can also cancel execution with ControlC, and if you press the sequence while the computer is hung the program will be terminated on the next Sleep call. In any case don't shutdown/reset your computer, the program will halt sooner or later. Borsuc, if possible write your own fib using your recursion method (don't cheat by using the iterative approach though because we already know it is O(n)). PS: I've reduced N to 44 because it is good enough to see timing differences and is obviously faster. [edit]Times with N=47 Code: f_xchg_dec: 74078ms; Result = 2971215073 f_xchg_sub: 71484ms; Result = 2971215073 fib_xchg_dec: 74625ms; Result = 2971215073 fib_xchg_sub: 73703ms; Result = 2971215073 fib_no@xchg_dec: 36562ms; Result = 2971215073 fib_no@xchg_sub: 34375ms; Result = 2971215073 

29 Dec 2009, 19:33 

Borsuc
r22 wrote: However, advanced recursion (two tailed) for instance like a quick sort algorithm would be more complex/slower if implemented iteratively because you'd be using a software implemented stack. LocoDelAssembly wrote: Borsuc, if possible write your own fib using your recursion method (don't cheat by using the iterative approach though because we already know it is O(n)). But even if you necessarily want to push the numbers on the stack, and since the function calls itself twice, you will have to push an identifier  ideally 1 bit  which can be used to index the "return address" instead. In this particular case, the program might be slower, since the function is very simple (just an addition!), but for large numbers you won't waste as much memory... _________________ Previously known as The_Grey_Beast Last edited by Borsuc on 29 Dec 2009, 21:40; edited 1 time in total 

29 Dec 2009, 20:45 

bitRAKE
Fibonacci only needs XADD/LOOP instructions.
"Recursion is selfexplanatory" kind of says why recursion is used  the code is simplier. Usually, this costs some in performance. My BlobDrop game has a multitail example of this simplicity (works like a floodfill algorithm): Code: .recurse.0: push ebp .recurse: cmp [ebp4*(GAME.COLUMNS+1)],ebx je .Up .rUp: cmp [ebp+4],ebx je .Right .rRight:cmp [ebp+4*(GAME.COLUMNS+1)],ebx je .Down .rDown: cmp [ebp4],ebx je .Left .rLeft: pop ebp retn .Up: mov [ebp4*(GAME.COLUMNS+1)],eax push .rUp push ebp sub ebp,4*(GAME.COLUMNS+1) jmp .recurse .Right: mov [ebp+4],eax push .rRight push ebp add ebp,4 jmp .recurse .Down: mov [ebp+4*(GAME.COLUMNS+1)],eax push .rDown push ebp add ebp,4*(GAME.COLUMNS+1) jmp .recurse .Left: mov [ebp4],eax push .rLeft push ebp sub ebp,4 jmp .recurse I can see how creating a common return address would reduce the amount of data on the stack (from 32bit return address to a 2bit number), and the least 2bits of the EBP register would be used to store this index number. Additionally, we would need to add special handling for a halt state. This is not a problem because EBP would never be zero. This would add 5/6 instructions to the loop? It would certainly be faster beyond some depth (beyond practicality for my use). Could a compiler know how to make this transformation, though? Certainly, not. Last edited by bitRAKE on 29 Dec 2009, 21:49; edited 1 time in total 

29 Dec 2009, 21:22 

Borsuc
bitRAKE wrote: "Recursion is selfexplanatory" kind of says why recursion is used  the code is simplier. _________________ Previously known as The_Grey_Beast 

29 Dec 2009, 21:44 

bitRAKE
Yeah, now that you mention it  an iterative version might even be simplier (and faster) for the above code. Wouldn't the worst case depth be the same, too? Another approach (used in my Hilbert curve examples) would be reverse the transformation on EBP for each step  requiring only 2bit storage for each depth step.
Win7 x64; Xeon L5410 2.3GHz, FSB 1333Mhz (important here?), computer doesn't hang at all. best of three runs N=44: Code: f_xchg_dec: 10093ms; Result = 701408733 f_xchg_sub: 9719ms; Result = 701408733 fib_xchg_dec: 10842ms; Result = 701408733 fib_xchg_sub: 14040ms; Result = 701408733 fib_no@xchg_dec: 11044ms; Result = 701408733 fib_no@xchg_sub: 10546ms; Result = 701408733 Code: f_xchg_dec: 42526ms; Result = 2971215073 f_xchg_sub: 41402ms; Result = 2971215073 fib_xchg_dec: 46956ms; Result = 2971215073 fib_xchg_sub: 59452ms; Result = 2971215073 fib_no@xchg_dec: 46925ms; Result = 2971215073 fib_no@xchg_sub: 45069ms; Result = 2971215073 

29 Dec 2009, 22:32 

LocoDelAssembly
Quote:
Borsuc, OK, but then remember that you'll have branch misprediction penalties that way, the RET instruction has the so called return address buffer which gives extra help to the prediction of the branch target address. 

29 Dec 2009, 23:35 

tthsqe
@LocoDelAssembly
I don't say code is slower just by looking at it. I know better than that now. The ICC output was almost twice as long but executed only about 510% slower and VC was about 2530% slower. so it looks like intel's cpus are better at xchg here than amd's and that sub has a slight advantage over dec. Does anybody have an implementation that is both faster and smaller than f_xchg_sub? something avoiding all of the return addresses would be of interest, but keep it at exponetial running time (naive recursion). PS: a very good way of actually computing these numbers using integers is to take the n th power of the matrix 11 10 using exponentiation by squaring (or something else with operation complexity O(log(n)) 

30 Dec 2009, 04:01 

LocoDelAssembly
Quote:
Obviously O(n^2)>O(n*log(n)*log(n)), but how long N should be to see the benefits in practice? 

30 Dec 2009, 06:35 

revolution
Binet's formula = O(1) or O(log(n))?


30 Dec 2009, 06:45 

tthsqe
@LocoDelAssembly:
Not all of those multiplications are as big as the last ones, so it shouldn't it be a little less? [Edit: doing the math, assuming that we have: 1. log(n) steps 2. at the the i th step, c^i digit numbers 3. a nice n log n multiplication we get sum_{i=1}^{log n}{i*c^i}, and this is O(n log(n)). End Edit] Anyways, here is mathematica 7 on a 1.6 GHz Celeron M: Table[Timing[Fibonacci[2^n];][[1]], {n, 18, 30}] {0.015, 0.016, 0.047, 0.093, 0.219, 0.514, 1.217, 2.637, 5.678, 11.762, 24.742, 52.307, 113.741} %/(10^9/9 2^Range[18, 30] Range[18, 30]^2) {1.58946, 0.760826, 1.00851, 0.905017, 0.970915, 1.04246, 1.13342, 1.13168, 1.12645, 1.0819, 1.05809, 1.04264, 1.0593} I would assume it uses some something based on the matrix power. Also, it looks like your prediction of O(n log(n)^2) is right on. (the running lime here seems to be assmptotic to 1.1*^10*n*log(n)^2). [Edit: n log n also looks good. End Edit] The question here is: can we beat mathematica? @revolution: in order to compute the integer part of phi^n, we have to know phi to O(n) decimal places, so it still comes down to multiprecision multiplications, and the time complexity is higher than the operation complexity. Last edited by tthsqe on 30 Dec 2009, 10:51; edited 1 time in total 

30 Dec 2009, 08:49 

bitRAKE
tthsqe wrote: The question here is: can we beat mathematica?
Which means that, given fib(n1), fib(n) and fib(n+1), we can calculate without iteration fib(2n1), fib(2n) and fib(2n+1). The code of the 'fibonaccimatrix' function can be replaced to use and return these three values rather than a matrix. 

30 Dec 2009, 09:31 

tthsqe
ritRAKE,
I'm not sure what to make of your post. The equations you give follow directly from the fact that {{1,0},{1,1}}^n is a matrix of f(n1) f(n) and f(n+1) (I think the anti diagonal elements are both f(n)). Since n was is power of 2, everything I've said points to the conjecture that mathematica is using exactly the equations you have written down. 

30 Dec 2009, 09:57 

f0dder
Borsuc wrote: You know actually it's very easy to imagine this even in a HLL: all you need is a simple iterative loop, and an array where you store the hierarchy and expand the array. Example: when scanning folders recursively, you just store the current folder pointer/ID/handle/whatever in the array, then go again and store the (new) current folder in the array (of course you append it to the array). When you run out of folders you simply decrease the array pointer to the previous element and loop again. For something like scanning folder trees, you don't really gain much (except code complexity!) from implementing in an iterative way... what's the largest tree depth you've seen on normal systems? r22 wrote: However, advanced recursion (two tailed) for instance like a quick sort algorithm would be more complex/slower if implemented iteratively because you'd be using a software implemented stack. 

30 Dec 2009, 12:23 

r22
@fodder  agreed, quicksort was a poor choice for an example/illustration.


30 Dec 2009, 14:36 

Goto page Previous 1, 2, 3, 4, 5 Next < Last Thread  Next Thread > 
Forum Rules:

Copyright © 19992020, Tomasz Grysztar.
Powered by rwasa.