flat assembler
Message board for the users of flat assembler.

Index > Main > A question on performance

Goto page Previous  1, 2, 3  Next
Author
Thread Post new topic Reply to topic
system error



Joined: 01 Sep 2013
Posts: 670
system error 12 Jan 2017, 15:14
revolution,

that's the true spirit of assembly programming. Do your own 'ret' handler by handling your own frame and details. Do you have problem with that code? I believe that code still qualifies for 'assembly programming'.

That may not be the actual microcode but that shows the amount of WORK (cpu cycles) to actually implement a RET instuction.
Post 12 Jan 2017, 15:14
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 12 Jan 2017, 15:19
system error wrote:
That may not be the actual microcode but that shows the amount of WORK (cpu cycles) to actually implement a RET instuction.
Not really. With the special internal call/ret stack things are very different at the microcode level. call/ret pairs can be very efficient if you don't try to mess with it by being "clever" with RISC alternatives. But I would still like an answer to my question: Have you tested it? What was the difference measured?
Post 12 Jan 2017, 15:19
View user's profile Send private message Visit poster's website Reply with quote
system error



Joined: 01 Sep 2013
Posts: 670
system error 12 Jan 2017, 15:39
revolution wrote:
system error wrote:
That may not be the actual microcode but that shows the amount of WORK (cpu cycles) to actually implement a RET instuction.
Not really. With the special internal call/ret stack things are very different at the microcode level. call/ret pairs can be very efficient if you don't try to mess with it by being "clever" with RISC alternatives. But I would still like an answer to my question: Have you tested it? What was the difference measured?


CALL/RET is directly supported by hardware, I agree. But that's not the point. The issue is with LOOP and SUB/JNZ which takes us to the discussion of simple vs complex instructions. I did time my code (see the previous codes). Replacing LOOP with SUB/JNZ yields better clocker in 10,000,000 loops. (You must be sleeping then). My point, the general strategy if you are aiming for speed is to try minimize complex instructions and replace then with the plain counterparts.

You can conduct your own tests on STOSB, LOOP if you want. Nobody is stopping you.


And another point on optimization I'd like to share;

2. Use the registers as the way they were originally designed for. For example, you should ECX/RCX for loop count instead of other registers like R15 or EDI. This yields faster results. Try it Very Happy
Post 12 Jan 2017, 15:39
View user's profile Send private message Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd! 13 Jan 2017, 05:49
I haven't written any FASM in a while, so I thought I would try to get back in practice. In the original snippets of this thread, the processor wasn't clearly defined and the results were given in terms of wall clock times rather than the more useful processor cycles.

I attempted to remedy the situation by offering code that times empty code and both snippets each 5 times and prints out the results in processor cycles.
Code:
format PE64 console

entry start

section '.text' code readable executable

start:
   sub rsp, 8*5
; Get Console handle
   mov ecx, -11 ; STD_OUTPUT_HANDLE
   call [GetStdHandle]
   mov [ConsoleHandle], rax
; Write header for logo
   and qword [rsp], 0
   mov r9, _written
   mov r8, logo_size
   mov rdx, _logo
   mov rcx, [ConsoleHandle]
   call [WriteConsole]
; Get logo
   mov eax, 80000002h
   cpuid
   mov [_brand], EAX
   mov [_brand+4], EBX
   mov [_brand+8], ECX
   mov [_brand+12], EDX
   mov eax, 80000003h
   cpuid
   mov [_brand+16], EAX
   mov [_brand+20], EBX
   mov [_brand+24], ECX
   mov [_brand+28], EDX
   mov eax, 80000004h
   cpuid
   mov [_brand+32], EAX
   mov [_brand+36], EBX
   mov [_brand+40], ECX
   mov [_brand+44], EDX
; Write logo
   mov rcx, _brand
   call [lstrlen]
   and qword [rsp], 0
   mov r9, _written
   mov r8, rax
   mov rdx, _brand
   mov rcx, [ConsoleHandle]
   call [WriteConsole]
   call newline

; Get family, model, stepping
   mov eax, 1
   cpuid
   mov [_fms], eax

; write family
   mov rcx, _brand
   mov rdx, _family_fmt
   mov r8d, [_fms]
   shr r8d, 8
   and r8d, 0fh
   call [wsprintf]
   mov rcx, _brand
   call [lstrlen]
   and qword [rsp], 0
   mov r9, _written
   mov r8, rax
   mov rdx, _brand
   mov rcx, [ConsoleHandle]
   call [WriteConsole]

; write extended family
   mov rcx, _brand
   mov rdx, _extended_family_fmt
   mov r8d, [_fms]
   shr r8d, 20
   and r8d, 0ffh
   call [wsprintf]
   mov rcx, _brand
   call [lstrlen]
   and qword [rsp], 0
   mov r9, _written
   mov r8, rax
   mov rdx, _brand
   mov rcx, [ConsoleHandle]
   call [WriteConsole]

; write model
   mov rcx, _brand
   mov rdx, _model_fmt
   mov r8d, [_fms]
   shr r8d, 4
   and r8d, 0fh
   call [wsprintf]
   mov rcx, _brand
   call [lstrlen]
   and qword [rsp], 0
   mov r9, _written
   mov r8, rax
   mov rdx, _brand
   mov rcx, [ConsoleHandle]
   call [WriteConsole]

; write extended model
   mov rcx, _brand
   mov rdx, _extended_model_fmt
   mov r8d, [_fms]
   shr r8d, 16
   and r8d, 0fh
   call [wsprintf]
   mov rcx, _brand
   call [lstrlen]
   and qword [rsp], 0
   mov r9, _written
   mov r8, rax
   mov rdx, _brand
   mov rcx, [ConsoleHandle]
   call [WriteConsole]

; write stepping
   mov rcx, _brand
   mov rdx, _stepping_fmt
   mov r8d, [_fms]
   shr r8d, 8
   and r8d, 0fh
   call [wsprintf]
   mov rcx, _brand
   call [lstrlen]
   and qword [rsp], 0
   mov r9, _written
   mov r8, rax
   mov rdx, _brand
   mov rcx, [ConsoleHandle]
   call [WriteConsole]

   mov ebx, 0
; Time successive RDTSCs
   empty:
      rdtsc
      mov dword [_times+8*rbx], eax
      mov dword [_times+8*rbx+4], edx
      rdtsc
      sub eax, dword [_times+8*rbx]
      sbb edx, dword [_times+8*rbx+4]
      mov dword [_times+8*rbx], eax
      mov dword [_times+8*rbx+4], edx
      inc ebx
      cmp ebx, 5
      jb empty

; Time with single instruction in loop
   op1:
      mov rax, 3ff0000000000000h
      movq xmm0, rax
      inc rax
      movq xmm1, rax
      rdtsc
      mov dword [_times+8*rbx], eax
      mov dword [_times+8*rbx+4], edx
      mov rcx, 10000000
      .spin:
         mulsd xmm0, xmm1
      loop .spin
      rdtsc
      sub eax, dword [_times+8*rbx]
      sbb edx, dword [_times+8*rbx+4]
      mov dword [_times+8*rbx], eax
      mov dword [_times+8*rbx+4], edx
      inc ebx
      cmp ebx, 10
      jb op1

; Time with 2 instructions in loop
   op2:
      mov rax, 3ff0000000000000h
      movq xmm0, rax
      inc rax
      movq xmm1, rax
      rdtsc
      mov dword [_times+8*rbx], eax
      mov dword [_times+8*rbx+4], edx
      mov rcx, 10000000
      .spin:
         mulsd xmm0, xmm1
         cvtsd2si rax, xmm0
      loop .spin
      rdtsc
      sub eax, dword [_times+8*rbx]
      sbb edx, dword [_times+8*rbx+4]
      mov dword [_times+8*rbx], eax
      mov dword [_times+8*rbx+4], edx
      inc ebx
      cmp ebx, 15
      jb op2

; Write out times for empty code
   and qword [rsp], 0
   mov r9, _written
   mov r8, empty_size
   mov rdx, _empty
   mov rcx, [ConsoleHandle]
   call [WriteConsole]
   mov ebx, 0
   write_empty:
      mov rcx, _brand
      mov rdx, _decimal_fmt
      mov r8d, dword [_times+8*rbx]
      call [wsprintf]
      mov rcx, _brand
      call [lstrlen]
      and qword [rsp], 0
      mov r9, _written
      mov r8, rax
      mov rdx, _brand
      mov rcx, [ConsoleHandle]
      call [WriteConsole]
      inc ebx
      cmp ebx, 5
      jb write_empty
   call newline      
      
; Write out times for single instruction loop
   and qword [rsp], 0
   mov r9, _written
   mov r8, op1_size
   mov rdx, _op1
   mov rcx, [ConsoleHandle]
   call [WriteConsole]
   write_op1:
      mov rcx, _brand
      mov rdx, _decimal_fmt
      mov r8d, dword [_times+8*rbx]
      call [wsprintf]
      mov rcx, _brand
      call [lstrlen]
      and qword [rsp], 0
      mov r9, _written
      mov r8, rax
      mov rdx, _brand
      mov rcx, [ConsoleHandle]
      call [WriteConsole]
      inc ebx
      cmp ebx, 10
      jb write_op1
   call newline      
      
; Write out times for 2 instruction loop
   and qword [rsp], 0
   mov r9, _written
   mov r8, op2_size
   mov rdx, _op2
   mov rcx, [ConsoleHandle]
   call [WriteConsole]
   write_op2:
      mov rcx, _brand
      mov rdx, _decimal_fmt
      mov r8d, dword [_times+8*rbx]
      call [wsprintf]
      mov rcx, _brand
      call [lstrlen]
      and qword [rsp], 0
      mov r9, _written
      mov r8, rax
      mov rdx, _brand
      mov rcx, [ConsoleHandle]
      call [WriteConsole]
      inc ebx
      cmp ebx, 15
      jb write_op2
   call newline      

   call [ExitProcess]

; Output newline
newline:

   sub rsp, 8*5
   and qword [rsp], 0
   mov r9, _written
   mov r8, crlf_size
   mov rdx, _crlf
   mov rcx, [ConsoleHandle]
   call [WriteConsole]
   add rsp, 8*5
   ret
   
section '.data' data readable writeable

   ConsoleHandle rq 1
   _written rq 1
   _times rq 15
   _brand rd 12
   _fms rd 1
   _logo db 'Processor: '
   logo_size = $-_logo
   _crlf db 13,10
   crlf_size = $-_crlf
   _family_fmt db 'Family: %X',10,0
   _extended_family_fmt db 'Extended Family: %X',10,0
   _model_fmt db 'Model: %X',10,0
   _extended_model_fmt db 'Extended Model: %X',10,0
   _stepping_fmt db 'Stepping: %X',10,0
   _empty db 'Empty:'
   empty_size = $-_empty
   _op1 db '1 operation:'
   op1_size = $-_op1
   _op2 db '2 operations:'
   op2_size = $-_op2
   _decimal_fmt db ' %ld',0

section '.idata' import data readable writeable

   dd 0,0,0,RVA kernel_name,RVA kernel_table
   dd 0,0,0,RVA user_name,RVA user_table

   kernel_table:
      GetStdHandle dq RVA _GetStdHandle
      WriteConsole dq RVA _WriteConsole
      lstrlen dq RVA _lstrlen
      ExitProcess dq RVA _ExitProcess
      dq 0

   user_table:
      wsprintf dq RVA _wsprintf
      dq 0

   kernel_name db 'kernel32.dll',0
   user_name db 'user32.dll',0

   _GetStdHandle dw 0
      db 'GetStdHandle', 0
   _WriteConsole dw 0
      db 'WriteConsoleA', 0
   _lstrlen dw 0
      db 'lstrlenA', 0
   _wsprintf dw 0
      db 'wsprintfA', 0
   _ExitProcess dw 0
      db 'ExitProcess', 0
    

The results on my Atom processor:
Code:
D:\FASM\code\atom_test>atom_test
Processor:       Intel(R) Atom(TM) x5-Z8330  CPU @ 1.44GHz
Family: 6
Extended Family: 0
Model: C
Extended Model: 4
Stepping: 6
Empty: 36 36 36 18 36
1 operation: 111028680 98019792 98329248 97719498 97695216
2 operations: 82669392 83004282 82680390 82665396 82668150
    
Post 13 Jan 2017, 05:49
View user's profile Send private message Visit poster's website Reply with quote
system error



Joined: 01 Sep 2013
Posts: 670
system error 13 Jan 2017, 11:57
xorpd

if u changed LOOP with SUB/JNZ, the result will be the opposite. So the real culprit to most slowdowns is the complex and the 'wrapper' instructions - not SSE instructions as I initially thought. They always give people such fake impressions of 'cleaner and shorter' code without considering the real history behind such ABBA-era, Vietnam War mnemonics.

But the worst kind are from ASM programmers coming from those era forcing people to stick to 'their' old school 'conventions' on modern CPUs with complete disregards on how those old instructions actually perform on 'our' CPUs. That's where you get most negative responses from every time you want to talk about optimization on this board.
Post 13 Jan 2017, 11:57
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8351
Location: Kraków, Poland
Tomasz Grysztar 13 Jan 2017, 13:26
system error wrote:
2. Use the registers as the way they were originally designed for. For example, you should ECX/RCX for loop count instead of other registers like R15 or EDI. This yields faster results. Try it Very Happy
When you use instructions like LOOP you are naturally forced to adhere to such standards. And when used in the right context they may add to the elegance of code. But that also depends on the overall methods of structuring the program that you use. I know this may be a highly subjective matter (compare the classic old discussions about xor reg,reg on this board).

My personal method is to optimize individual instructions for speed only in the critical sections of code, and only when it gives an actual gain on the machines where the code is going to be used. Otherwise, I go mostly for the size optimization. Changing instruction to a longer but faster equivalent in a non-critical routine usually gives no noticeable speed gain, while using shorter code may often add up to a noticeable effect overall and is a more universal advantage.

For example, you may find LOOP used in a few places in fasmg (even more in old fasm), usually where I considered it the aesthetically pleasing choice. Replacing them with alternatives yield no perceivable difference in execution time (at least not on the machines I've been working with). On the other hand, the shorter length of code adds up to measurable differences, sometimes to an important effect (like when I strove to keep fasm working with the 64k code segment limit of unreal mode).
Post 13 Jan 2017, 13:26
View user's profile Send private message Visit poster's website Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd! 13 Jan 2017, 16:15
@system error, I was hoping you would just try running my code and post the results so those following the thread could at last determine which processor you have and see the performance in processor cycles.
Post 13 Jan 2017, 16:15
View user's profile Send private message Visit poster's website Reply with quote
system error



Joined: 01 Sep 2013
Posts: 670
system error 13 Jan 2017, 17:58
xorpd, I've given you the suggestion to run the code for the other case (using SUB/JNZ in place of LOOP). Since you are already at it, then show us the result for both cases. What's 'wrong' with you?

Your code is a bit off to my liking. I mean what kind of hippopotamus that would perform 2 COSTLY non-critical memory write instructions in between 2 RDTSC? Dont you think they will commit even more cpu cycles to the entire result than the ones we are actually interested in testing? What's wrong with placing them in two registers? Assigning rcx before RDTSC is even smarter, my son.
Post 13 Jan 2017, 17:58
View user's profile Send private message Reply with quote
system error



Joined: 01 Sep 2013
Posts: 670
system error 13 Jan 2017, 18:07
Tomasz Grysztar wrote:
system error wrote:
2. Use the registers as the way they were originally designed for. For example, you should ECX/RCX for loop count instead of other registers like R15 or EDI. This yields faster results. Try it Very Happy
When you use instructions like LOOP you are naturally forced to adhere to such standards. And when used in the right context they may add to the elegance of code. But that also depends on the overall methods of structuring the program that you use. I know this may be a highly subjective matter (compare the classic old discussions about xor reg,reg on this board).

My personal method is to optimize individual instructions for speed only in the critical sections of code, and only when it gives an actual gain on the machines where the code is going to be used. Otherwise, I go mostly for the size optimization. Changing instruction to a longer but faster equivalent in a non-critical routine usually gives no noticeable speed gain, while using shorter code may often add up to a noticeable effect overall and is a more universal advantage.

For example, you may find LOOP used in a few places in fasmg (even more in old fasm), usually where I considered it the aesthetically pleasing choice. Replacing them with alternatives yield no perceivable difference in execution time (at least not on the machines I've been working with). On the other hand, the shorter length of code adds up to measurable differences, sometimes to an important effect (like when I strove to keep fasm working with the 64k code segment limit of unreal mode).


Well, since the rest of us here are non-compiler writers, I am quite sure this discussion is quite irrelevant to FASM internal design.
Post 13 Jan 2017, 18:07
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8351
Location: Kraków, Poland
Tomasz Grysztar 13 Jan 2017, 18:57
system error wrote:
Well, since the rest of us here are non-compiler writers, I am quite sure this discussion is quite irrelevant to FASM internal design.
Did you want to discuss some other specific application then? Or would you consider compiler writing (even though it was just an example in my post) different from any other programming altogether?
Post 13 Jan 2017, 18:57
View user's profile Send private message Visit poster's website Reply with quote
system error



Joined: 01 Sep 2013
Posts: 670
system error 13 Jan 2017, 19:32
TG, I don't see any relevancy of this topic to what you doing in FASM internal. This is application-level optimization. FASM compiler basically translates user codes to their machine counterparts. If the programmer chose to use slower instructions, the compiler translates them as exactly - can't blame the assembler. Do you have something else in mind?
Post 13 Jan 2017, 19:32
View user's profile Send private message Reply with quote
Trinitek



Joined: 06 Nov 2011
Posts: 257
Trinitek 13 Jan 2017, 19:53
system error wrote:
TG, I don't see any relevancy of this topic to what you doing in FASM internal. This is application-level optimization. FASM compiler basically translates user codes to their machine counterparts. If the programmer chose to use slower instructions, the compiler translates them as exactly - can't blame the assembler. Do you have something else in mind?
Surely preprocessing and macro parsing have something relevant to the topic.
Post 13 Jan 2017, 19:53
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8351
Location: Kraków, Poland
Tomasz Grysztar 13 Jan 2017, 19:54
system error wrote:
TG, I don't see any relevancy of this topic to what you doing in FASM internal. This is application-level optimization. FASM compiler basically translates user codes to their machine counterparts. If the programmer chose to use slower instructions, the compiler translates them as exactly - can't blame the assembler. Do you have something else in mind?
You may have missed the fact that fasm is itself written in assembly. Wink
Post 13 Jan 2017, 19:54
View user's profile Send private message Visit poster's website Reply with quote
system error



Joined: 01 Sep 2013
Posts: 670
system error 13 Jan 2017, 20:30
Trinitek wrote:
system error wrote:
TG, I don't see any relevancy of this topic to what you doing in FASM internal. This is application-level optimization. FASM compiler basically translates user codes to their machine counterparts. If the programmer chose to use slower instructions, the compiler translates them as exactly - can't blame the assembler. Do you have something else in mind?
Surely preprocessing and macro parsing have something relevant to the topic.


If I give you one .obj file to be linked by a linker, how to tell that it was compiled with FASM and not by NASM? See what I meant by relevancy here?
Post 13 Jan 2017, 20:30
View user's profile Send private message Reply with quote
system error



Joined: 01 Sep 2013
Posts: 670
system error 13 Jan 2017, 20:37
Tomasz Grysztar wrote:
system error wrote:
TG, I don't see any relevancy of this topic to what you doing in FASM internal. This is application-level optimization. FASM compiler basically translates user codes to their machine counterparts. If the programmer chose to use slower instructions, the compiler translates them as exactly - can't blame the assembler. Do you have something else in mind?
You may have missed the fact that fasm is itself written in assembly. Wink


Well of course. That's why I am a big fan of FASM Very Happy

Here's the thing Tomasz - the fact that you been keeping yourself away from optimization discussions all these long gave me the false impressions that you were not particularly interested in this type of bloodbath. So when you interfered, it gives me the creeps and quite shocking tbh. Well, that's quite natural I think considering your position. So my general idea is to keep you away! hahaha Very Happy
Post 13 Jan 2017, 20:37
View user's profile Send private message Reply with quote
Trinitek



Joined: 06 Nov 2011
Posts: 257
Trinitek 13 Jan 2017, 21:00
system error wrote:
Trinitek wrote:
system error wrote:
TG, I don't see any relevancy of this topic to what you doing in FASM internal. This is application-level optimization. FASM compiler basically translates user codes to their machine counterparts. If the programmer chose to use slower instructions, the compiler translates them as exactly - can't blame the assembler. Do you have something else in mind?
Surely preprocessing and macro parsing have something relevant to the topic.
If I give you one .obj file to be linked by a linker, how to tell that it was compiled with FASM and not by NASM? See what I meant by relevancy here?
No, I don't follow. In this thread about performance, concerns about compilation speed (i.e. a compiler's performance) in an assembler are irrelevant because it's "different" from other programs?
Post 13 Jan 2017, 21:00
View user's profile Send private message Reply with quote
system error



Joined: 01 Sep 2013
Posts: 670
system error 13 Jan 2017, 23:18
Trinitek

The discussions on performance are the same across different compilers. The talking points we are discussing here also apply to MASM, NASM, POASM boards as well, because once they been turned into binaries, they all look the same to the CPU/linker regardless of compilers/assemblers used to produce such binaries. At this very point, the assembler/compiler used is irrelevant.

If you want to talk about compiler optimization (time, size), you go to Compiler Internals as I already did here.
Post 13 Jan 2017, 23:18
View user's profile Send private message Reply with quote
fragment



Joined: 11 Jan 2017
Posts: 3
Location: Berlin
fragment 14 Jan 2017, 14:59
AsmGuru62 wrote:
Welcome to the forum

thanx ...
so you use 'Loop' as representation for other higher level instructions to optimize your approach (easier coding, more readable source etc.) and 'sub reg,1/jcc' constructions as representation for instructions which lack a bit on this aspect but optimize the code ...

let's look at the result: most of your code is unoptimized/suboptimal/less optimal (the one you considered as less relevant) except for a few fragments which you considered as relevant ...

if I asked you now about this:

a) easier coding but suboptimal code
b) suboptimal coding but optimal code
c) mixing a) and b) and changing what you priorize
d) general as easy as possible coding AND general the best possible code

what would you chose now?

Quote:
I must add also that your figure of ~100% is probably not correct.
I measured once my code where I used LOOP vs SUB/JNZ and I came up with ~15% slowdown.

I did not say/meant that the Loop-Instruction is ~100% slower on all computers - which would anyway not fit to my statement about my I7 and to my general conclusion. But in case I was a bit unclear: this ~100% was about my laptop etc.
Post 14 Jan 2017, 14:59
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8351
Location: Kraków, Poland
Tomasz Grysztar 14 Jan 2017, 15:55
fragment wrote:
if I asked you now about this:

a) easier coding but suboptimal code
b) suboptimal coding but optimal code
c) mixing a) and b) and changing what you priorize
d) general as easy as possible coding AND general the best possible code

what would you chose now?
For me it's never about "easy" coding. When I want my code to "look good" (in my own perception, at least) it is often a hard work to make it right - you could call it another kind of optimization, in a different aspect than the speed of individual code blocks. However these two often correlate, because the code that I deem pretty is usually the most simple and well-structured one, and this often results in overall well executing programs.

And even when the speed is your only priority, just like there may exist many possible solutions for code resolving and fasm arrives at just one, there may exist many different variants of code that would be just as "optimal" (that is: just as fast). For instance it might happen that when you replace all the complex instructions in a block of code with their faster equivalents, you make the entire block longer just as much as to make it not fit well in cache - and it may turn out that both variants execute equally fast (though for completely different reasons). As revolution likes to point out, this is all very context dependent and any rule of thumb may be completely wrong in some cases.

So my approach is: write a good and well structured code and this should pay out all by itself. When a speed of a particular bunch of instructions has a negligible effect on the overall execution, choose a variant that fits well the structure of your program and is the most elegant (this is subjective, of course - use your own style). The "hard" instruction-level optimization should come last and only where and when it is really required.
Post 14 Jan 2017, 15:55
View user's profile Send private message Visit poster's website Reply with quote
AsmGuru62



Joined: 28 Jan 2004
Posts: 1619
Location: Toronto, Canada
AsmGuru62 14 Jan 2017, 16:08
I am not sure I understand your question on a,b,c,d.
I write code currently using FASM and I will do the following:

1. If any of my loops or functions dealing with a lot of data - I will optimize as per Intel Manuals: no complex instructions (LOOP is out), using mostly registers, aligning labels, etc.

2. If any of my loops or functions are NOT dealing with a lot of data, then I will prefer the smaller (in bytes) code. I will use LOOPs - yes.

However, before coding I will design the proper algorithm.
Post 14 Jan 2017, 16:08
View user's profile Send private message Send e-mail Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.