flat assembler
Message board for the users of flat assembler.
Index
> Main > String length Goto page Previous 1, 2, 3, 4, 5, 6 Next |
Author |
|
HaHaAnonymous 05 Nov 2013, 00:28
[ Post removed by author. ]
Last edited by HaHaAnonymous on 28 Feb 2015, 19:14; edited 1 time in total |
|||
05 Nov 2013, 00:28 |
|
HaHaAnonymous 05 Nov 2013, 01:32
[ Post removed by author. ]
Last edited by HaHaAnonymous on 28 Feb 2015, 19:13; edited 1 time in total |
|||
05 Nov 2013, 01:32 |
|
ejamesr 05 Nov 2013, 18:22
revolution asked for timing measurements; the differences are slight. Is it worth it? Probably not. But... I've found these inner-loop differences when using negative counters to be much greater in smaller loops, where the value of eliminating a single line of code affects a greater percentage of the time.
Here's what I measured when testing 100*1000*1000 reps five times: Code: This was based on my Core2 Duo 2.66Ghz HP HDX-16t laptop. Timings based on 100 million reps, five sets, compared performance with different numbers of white-space chars in front of the token; avg CPU clocks displayed: Proc 0 w/s chars 4 w/s chars 8 w/s chars 12 w/s chars ---- ----------- ----------- ----------- ------------ FindToken 431af95e 842e51f2 bf0e6ce8 1:05d1ecf8 FindToken_rev 4325fefa 72642f88 bc81d24c ff5d254a When you look at my code, I reversed the return values used to denote 'no token' and 'too big' in order to take advantage of what was happening at the register level. Here's my test code: Code: format PE CONSOLE 4.0 stack 4000h, 4000h entry Start ; Constants to modify NUMREPS = 1000*1000*100 NUMSETS = 5 NUMWHITESPACE = 12 ; test values 0, 4, 8, 12 MAXTOKENLEN = 20 TESTPROC equ FindToken TESTPROC equ FindToken_rev ; comment out to test 'FindToken' include 'Win32A.Inc' include 'Macro\If.Inc' section '.data' data readable writeable Divisor dd NUMSETS TStart dq ? TCum dq 0 tempToken: times NUMWHITESPACE db ' ' db 'Token', 0 align 16 proc FindToken Offset,MaxLen ; original by Sasha mov eax,[Offset] mov ecx,[MaxLen] add ecx,eax xor edx,edx dec eax .loop: inc eax cmp eax,ecx ja .finish_too_big mov dl,byte[eax] cmp edx,020h je .loop cmp edx,09h je .loop or edx,edx je .finish_no_token .finish: ;Returns an offset in eax ret .finish_too_big: or eax,-1 ret .finish_no_token: xor eax,eax ret endp align 16 proc FindToken_rev Offset,MaxLen ; revised by Eric mov ecx,[Offset] mov eax,[MaxLen] ; use eax as counter add ecx, eax ; point to end of data neg eax ; eax is now a negative counter dec eax .loop: inc eax ; point to next byte AND test for MaxLen jz .finish_too_big ; when eax becomes positive, the end was reached mov dl,byte[eax+ecx] cmp dl,020h ; using 'edx' instead of 'dl' incurs 'partial-register-stall' penalty je .loop cmp dl,09h je .loop test dl, dl ; in some cases this is faster than 'or dl,dl' jz .finish_no_token .finish: ;Returns an offset in eax lea eax, [eax+ecx] ; quickly compute the offset ret .finish_too_big: ; at this point, eax is 0... if you use 0 to ; indicate "too big" you can save another instruction here ret .finish_no_token: or eax,-1 ; use -1 to indicate no token ret endp Start: xor eax, eax mov dword [TCum], eax mov dword [TCum+4], eax rdtsc mov dword [TStart], eax mov dword [TStart+4], edx mov edi, NUMSETS .newSet: mov esi, NUMREPS .newRep: stdcall TESTPROC, tempToken, MAXTOKENLEN dec esi ; do next rep jnz .newRep ; All reps done, record time rdtsc push eax push edx ; Get elapsed time sub eax, dword [TStart] sbb edx, dword [TStart+4] ; Accumulate time add dword [TCum], eax adc dword [TCum+4], edx ; reset start time... pop edx pop eax mov dword [TStart], eax mov dword [TStart+4], edx ; See if another set dec edi jnz .newSet ; Finished, calculate average time mov eax, dword [TCum+4] xor edx, edx div [Divisor] push eax ; save quotient, combine after next step mov eax, dword [TCum] div [Divisor] add edx, [esp] add esp, 4 ; clean up stack int3 invoke ExitProcess, 0 section '.idata' import data readable writeable library kernel32,'KERNEL32.DLL',user32,'USER32.DLL',gdi32,'GDI32.DLL' include 'API\Kernel32.Inc' include 'API\User32.Inc' include 'API\Gdi32.Inc' section '.edata' export data readable export 'Program.exe',\ Start,'Start',\ Start.newSet,'Start_newSet',\ Start.newRep,'Start_newRep',\ FindToken,'FindToken',\ FindToken_rev,'FindToken_rev',\ Divisor,'Divisor',\ TStart,'TStart',\ TCum,'TCum' section '.reloc' fixups data readable discardable Gotta get back to work now. Eric |
|||
05 Nov 2013, 18:22 |
|
Sasha 05 Nov 2013, 18:55
I've made all the changes.
Code: proc StrzLen4 Strz,MaxLen mov eax,[MaxLen] mov ecx,[Strz] add ecx,eax neg eax dec eax jmp @f .loop: cmp byte[ecx+eax],0 jz .finish_ok @@: inc eax js .loop .too_big: or eax,-1 ret .finish_ok: add eax,[MaxLen];Fixed. Using eax as a counter now. ret endp I thought about benchmarking, too. But I need another proc to compare with. To get a relative values. Also need some real strings to parse, otherwise it is a spherical cow) Last edited by Sasha on 05 Nov 2013, 20:15; edited 2 times in total |
|||
05 Nov 2013, 18:55 |
|
ejamesr 05 Nov 2013, 19:40
Looks good, but the 'lea' line should probably be something like 'add ecx, [MaxLen]'.
If speed is important to you, benchmarking can be your best friend. But... benchmarking itself can produce results that are inconsistent and that depend on what's happening in the data and instruction caches due to what other processes are doing that can 'contaminate' the cache that your program has created due to normal program execution. I see your code jumps around .loop. That makes available another little improvement that can sometimes speed things up. Add 'align 16' just in front of .loop. Since there's a jump around it, the 'nop' instructions never get executed. Or, a more sophisticated way would be to align the start of the procedure so that .loop is exactly aligned on a 16-byte boundary. This helps ensure that all instructions of the small loop fit on as few cache lines as possible, running as fast as they can. Back to work for me now... |
|||
05 Nov 2013, 19:40 |
|
Sasha 05 Nov 2013, 22:27
Benchmark showed that this code is very little, but faster than the previous.
Code: proc StrzLen5 Strz,MaxLen mov eax,[MaxLen] mov ecx,[Strz] add ecx,eax neg eax dec eax .loop: inc eax jns .too_big cmp byte[ecx+eax],0 jnz .loop .finish_ok: add eax,[MaxLen] ret .too_big: or eax,-1 ret endp Just removed the jmp. So even such little differences can be seen. |
|||
05 Nov 2013, 22:27 |
|
Sasha 05 Nov 2013, 22:57
This is my benchmaker)
|
|||||||||||
05 Nov 2013, 22:57 |
|
ejamesr 05 Nov 2013, 23:18
Sasha,
I'm glad you are willing to test these things... it makes you better and your code faster. Here's another thing to test. On some computers, a command like: Code: cmp byte[ecx+eax],0 Code: mov dl, [ecx+eax] test dl, dl ; or 'cmp reg, immValue' I always use 'test' rather than 'or' because I suspect it's slightly faster, but I've never tested it. HOWEVER, everything changes; what is true today may change tomorrow as technology advances. It's good to retest things periodically to see if new CPU architectures have changed things. As you've seen, sometimes very small differences can be measured, and they can add up to a noticeable difference. And there is almost ALWAYS a way to still make things faster. Happy programming! Eric |
|||
05 Nov 2013, 23:18 |
|
Sasha 06 Nov 2013, 00:14
"test dl,dl" and "test edx,edx" is almost x3 faster than "or dl,dl" and "or edx,edx" !!!
I used this approach Code: ;Put your code here. Do not use ebx ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; test edx,edx test edx,edx test edx,edx test edx,edx test edx,edx test edx,edx test edx,edx test edx,edx test edx,edx test edx,edx test edx,edx test edx,edx test edx,edx test edx,edx test edx,edx I think, the explanation is that in this case, "test" doesn't wait the previous "test" to finish, and the instructions can be paired. While "or" does wait. But such a code will not be met in the real world, so yes, only the real code must be benchmarked. |
|||
06 Nov 2013, 00:14 |
|
revolution 06 Nov 2013, 13:39
It is good to see some proper testing going on here. So now you will know if you are making any significant changes or not.
But I do encourage you to do a proper utility analysis. ejamesr asked above "Is it worth it?" And the question is very important and should not be ignored lightly. Is it worth it to spend two weeks optimising your code for speed (tested for speed on one CPU only also) and manage to save 1μs per execution, and then find that for the lifetime of the program (say five years) of 1 billion executions you can only save 1000 seconds total (less than one hour)? |
|||
06 Nov 2013, 13:39 |
|
HaHaAnonymous 06 Nov 2013, 15:43
[ Post removed by author. ]
Last edited by HaHaAnonymous on 28 Feb 2015, 19:13; edited 2 times in total |
|||
06 Nov 2013, 15:43 |
|
ejamesr 06 Nov 2013, 18:13
Good points from revolution.
Optimization may make little sense if the majority of your program's execution is in a wait loop waiting for user input (can you even build a "faster" wait loop??). And it may make little sense to try to speed up a core inner loop if the data being accessed is not in the L2 cache, where you may end up waiting hundreds of nanoseconds for the data to be accessible (try accessing elements of a 500MB array randomly, compared to accessing a similar 500k array in the same way -- I've seen differences from about 10x to 100x slower, and even worse, with the larger array!). But... having a sense of a "need for speed" can help motivate you to think up better algorithms that will provide the serious performance leaps you want. It doesn't hurt to improve your understanding of what actually happens inside the different flavors of different CPUs. It's good to continue learning faster/better ways to do things. Making it work properly is most important... but still not good enough, for me. It's probably a genetic defect that causes me to keep wanting to make things faster and faster. The reality, however, is you have to find the proper balance. For example, I put off learning how to use the FASM 'match' directive until a couple weeks ago. Now, with a far better (but still imperfect) understanding, I've created several helper macros that are already saving me time and maintenance headaches when programming. (And I'll either be able to get stuff out the door faster, OR have more time to optimize the critical areas.) Getting projects "out the door quickly" is sometimes more important than having them execute another 4.67% faster; sometimes "getting it done NOW" is where the real need for speed is. But it's always fun to try shaving off a little bit more execution time; and sometimes that's when the real blockbuster innovations come. |
|||
06 Nov 2013, 18:13 |
|
Sasha 18 Jun 2014, 18:26
JohnFound wrote: If you want to optimize for speed you definitely should not compare byte by byte. It is the slowest possible solution. Below is the version of StrLen procedure from FreshLib. As you can see, it reads and compares the string by double words. Decided to come back to this question, and finally this is what I've got. Code: align 32 proc strlen uses ebx esi edi,str mov eax,[str] mov ebx,-01010101h .aligning: test eax,3 jz .scan mov dl,[eax] test dl,dl jz .found inc eax jmp .aligning align 32 .scan: mov esi,[eax] mov edi,[eax+4] lea eax,[eax+8] lea ecx,[esi+ebx] ;! lea edx,[edi+ebx] not esi not edi and ecx,esi and edx,edi and ecx,$80808080 and edx,$80808080 test ecx,ecx ;!! jnz .sub8 test edx,edx jz .scan ; byte 0 was found: so search by bytes. lea eax,[eax-4] mov ecx,edx jmp .bytesearch .sub8: lea eax,[eax-8] .bytesearch: ;!!! test cl,cl jnz .found inc eax test ch,ch jnz .found shr ecx,16 inc eax test cl,cl jnz .found inc eax .found: sub eax,[str] ret endp |
|||
18 Jun 2014, 18:26 |
|
JohnFound 19 Jun 2014, 06:34
Sasha, did you made some benchmarks on this code?
|
|||
19 Jun 2014, 06:34 |
|
r22 19 Jun 2014, 11:22
This topic has been discussed in the past if you use the forum's search. http://board.flatassembler.net/topic.php?t=7710
|
|||
19 Jun 2014, 11:22 |
|
Sasha 19 Jun 2014, 11:27
Yes, I've made after every change, to know if I'm in the right dirrection. But, this is only for my computer.
|
||||||||||
19 Jun 2014, 11:27 |
|
Sasha 19 Jun 2014, 11:42
r22 wrote: This topic has been discussed in the past if you use the forum's search. http://board.flatassembler.net/topic.php?t=7710 I saw it, but then I focused on optimizing fresh lib's routine. I'll review that thread for another ideas. |
|||
19 Jun 2014, 11:42 |
|
JohnFound 19 Jun 2014, 11:52
Sasha wrote: Yes, I've made after every change, to know if I'm in the right dirrection. But, this is only for my computer. Hm, the result is interesting. The performance gain is small, but the optimized code graphic is much smoother than the original. I am curious what change caused this smoothness? _________________ Tox ID: 48C0321ADDB2FE5F644BB5E3D58B0D58C35E5BCBC81D7CD333633FEDF1047914A534256478D9 |
|||
19 Jun 2014, 11:52 |
|
Sasha 19 Jun 2014, 12:12
The smoothness is what I really wanted to gain. The worst thing is this "sub eax,9" and then scan again.
I tried to extract the information about where is zero byte without scaning again. A saw, that all that 8 bytes are already in registers. |
|||
19 Jun 2014, 12:12 |
|
Goto page Previous 1, 2, 3, 4, 5, 6 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.