flat assembler
Message board for the users of flat assembler.
![]() |
Author |
|
kohlrak 10 Oct 2009, 20:25
Benchmarks are really, really misleading. For this reason i only use rdtsc anymore and run a ton of runs, but that's not always going to be accurate for that program. The playing field significantly changes for size of program as well (due to caching).
I must point out, though, that you have 3 branches in your code (and branch prediction sucks more than you'd think), and you are using the string functions. The reason cmps is so slow because all the string functions have huge microcode and for some reason, it doesn't seem to get cached right. I've seen other similar phenomena (especially with smaller looping routines) that has led me to believe people when they say, the x86 is optimized for bloat. edit: a submission: no i didn't test these, but if they work, tell me how well. Code: mov esi, szOne mov edi, szTwo and ebx, 0 and eax, 0 @@: mov dl, [esi+ebx] cmp dl, [edi+ebx] jne @f inc ebx or dl, dl jnz @b @@: ret Code: mov esi, szOne mov edi, szTwo and eax, 0 @@: inc esi inc edi mov dl, [esi] cmp dl, [edi] jne @f or dl, dl jnz @b @@: ret I have this unfortunate feeling that there's a bug there somewhere... |
|||
![]() |
|
bitshifter 10 Oct 2009, 21:16
Conditional moves would speed it up a bit.
Agner fog has a pretty fast set of string procs... http://www.agner.org/optimize/#asmlib Get the Subroutine Library and check 'em out. |
|||
![]() |
|
LocoDelAssembly 10 Oct 2009, 21:34
Well, I couldn't found much in the forum about this (perhaps it was not discussed much actually?). There is more info about strlen but that if of little help.
http://board.flatassembler.net/topic.php?t=2633 |
|||
![]() |
|
pal 10 Oct 2009, 23:32
kohlrak: Good point about the benchmarks. I will whip up a quick rdtsc code tomorrow. Do you just subtract the final value from the start value to get the number of ticks elapsed in the time? Or do you then convert it to seconds? If so surely use the high precision timers that Windows offers?
I also tested your codes. First code - 9.934704 seconds. Second code - 10.894187 seconds. I edited them slightly so that they are stdcall in a procedure. bitshifter: Yeah, I think I have seen them before, cheers. I actually thought of him when I came to think of testing methods as he made some codes to test codes if I recall correctly. Loco: I started this thread as there was like 3 threads in the search when you do "stcmp". I am gonna have a go at some SSE ones tomorrow. There are some instruction mnemonics for it, but the instructions themselves are only available for i7 processors (SSE4.2 I believe), and I don't have SSE4. Madis linked them on the forum, but I cant remember the link. |
|||
![]() |
|
kohlrak 11 Oct 2009, 00:30
Quote: kohlrak: Good point about the benchmarks. I will whip up a quick rdtsc code tomorrow. Do you just subtract the final value from the start value to get the number of ticks elapsed in the time? Or do you then convert it to seconds? If so surely use the high precision timers that Windows offers? I honestly don't feel that any benchmark can truely be accurate. They all vary in speed, and actually rdtsc takes whole miliseconds by itself just to work. I use rdtsc because it's usable on most x86 machines out there today and i don't have to rely on windows to use it. Truth is, most of the advice given years ago on how to optimize the CPU is still valid today, with a few additions. For instance, our "machine code" sort of like x86's equivalent to java's bytecode. This makes compatability easier, but at the cost of some things. Mainly, more commonly used instructions are often faster than the more complex instructions which are specialized for the task (CPU microcode assignment is based on freaquency). It's a real shame, because, IMO, those string operations can save space in the long run (you could make print macro that take up less space than calling a particular function that does the same thing). The only way for specialized instructions to succeed in x86 is if they are improvements that outweigh the disadvantages of the worst microcode they can be assigned. SSE is an example of something that works, mainly because it gives an overall algorithmic advantage, but even still, if you compare the advantages, it's not as big of an advantage as one would actually expect (though still significant, believe me). Theoretically, the improvement could reach 5 times (i'm stretching to 5 times because you don't have to move things to a specific register before pushing onto or popping off a register) the speed of fpu equivalents, though the reality is different because it would require some really expensive hardware and really intellegent math to do that. Quote: I also tested your codes. Actually, i expected a little more from the first code, but i kinda figured the second would be that much slower than the first. This shows you that the extra instruction to do increment slows down more than the actual addition itself (decoding process and cache limitations). Almost a whole second gained... Don't forget, though, that the first code doesn't mean an overall faster program. Remember that the second code gives you the benefit of an extra register not being messed with. If you're coding in an environment where you feel that you can rely on that procedure to never change (like making your own kernel or always using that function because you yourself maintain it), you can deffinately benefit from the second function. Depending on how often you use that function and the benefits of having an extra register available, it may or may not be more beneficial to use one over the other. Try to keep this in mind when trying to find the "fastest way" to do anything. |
|||
![]() |
|
pal 11 Oct 2009, 11:24
I never realised that the string instructions are actually slower in cases as such. Also the problem with using the string instructions is that you have to find the length of the string first. If you don't a code like so:
Code: mov esi,[szOne] mov edi,[szTwo] xor ecx,ecx not ecx repe cmpsb jne @F Wont work well as if the strings point to the same location then it will be an infinite (well not infinite) loop as all characters will be the same. This I have found is also the problem with SSE. You will have to find the length of the string, unless there is some instruction mnemonic I am not aware of. I am using pcmpeqb and pmovmskb to compare the strings, but again this could turn into an infinite loop. Any solutions to this? The only thing I can think of is checking if the next byte is 0 or not, but this means doing a byte by byte comparison, and hence there is no need for SSE. I cannot think of any way to compare more than one byte at once. |
|||
![]() |
|
kohlrak 11 Oct 2009, 13:14
I'm not sure about sse (i'm not very good with that myself), but for regular string functions, your rep won't work unless, as you said, you have your string lengths. So, take it out and make a regular loop, determined by whether or not either [edi] or [esi] is 0 for that run (doesn't matter if you compare the null terminator or not, since if the strings are equal, they should be 0... actually you should do that anyway, because if you have "hello" and "hello2," they're not the same but if you don't compare the null terminators, they'll come back as the same.)
|
|||
![]() |
|
r22 12 Oct 2009, 15:48
If your going to benchmark code
- make sure it's all in the same format/calling convention. - make sure all the variants are CORRECT in ALL CASES. ;; INVOKE str.cmp, <'test',0>, <'test',0> ;; = TRUE ;; INVOKE str.cmp, <0>, <0> ;; = TRUE ;; INVOKE str.cmp, <'a',0>, <0> ;; = FALSE ;; INVOKE str.cmp, <0>, <'a',0> ;; = FALSE ;; INVOKE str.cmp, <'testa',0>, <'testb',0> ;; = FALSE ;; INVOKE str.cmp, <'testa',0>, <'testab',0> ;; = FALSE ;; INVOKE str.cmp, <'testab',0>, <'testa',0> ;; = FALSE - define the tested input. If your testing a strings the input should be pseudo randomly selected strings of varying length. Whats the mean length of the strings you think will be used as input? If the strings are short unroll the loop and inline it, if they are long then a separate function will be necessary. Will the strings be padded to a certain alignment for trailing NULLs? Not padded <'Hello',0> vs padded to dword <'Hello',0,0,0> the latter can be optimized much better. |
|||
![]() |
|
Borsuc 12 Oct 2009, 19:35
yeah. Hold the string length as a prefix instead of a null terminator.
_________________ Previously known as The_Grey_Beast |
|||
![]() |
|
pal 12 Oct 2009, 20:42
r22 wrote: - make sure it's all in the same format/calling convention. Done. I have made them all into stdcall. I haven't tested the strcmp in MSVCRT.DLL, although I have a feeling is it quite bloated. r22 wrote: - make sure all the variants are CORRECT in ALL CASES. I didn't check that many cases. A good option I guess would be to use a PRNG to input values into it for x amount of times and check to make sure it works. r22 wrote: If your testing a strings the input should be pseudo randomly selected strings of varying length. Surely the data inputted has to be the same for each test otherwise you could have one function comparing a 9 character string whilst one compares a 4 character string. Do you do like a large amount of tests then average it out? Heh, the above you answered it in the next sentence. I should learn to read ahead. r22 wrote: If the strings are short unroll the loop and inline it, if they are long then a separate function will be necessary. Yeah, I guess two functions could be made and a macro determines which function shall be used, or something like that? r22 wrote: Will the strings be padded to a certain alignment for trailing NULLs? Not padded <'Hello',0> vs padded to dword <'Hello',0,0,0> the latter can be optimized much better. I was just doing with one null byte after, although padding to dword can be done easily I guess. What advantage does this present over not doing this? |
|||
![]() |
|
r22 12 Oct 2009, 23:35
@pal
Padding and optimization If you know that your parameters are null padded to dword (4 bytes) then you can compare your strings 4 characters at a time. Code: MOV EAX, dword[ESI + ECX * 4] CMP EAX, dword[EDI + ECX * 4] 4 times as much work gets done per iteration. Borsuc's point about storing the strings length can also allow for greater optimization. Because you don't know the lengths of your zero terminated strings you need to do a check for null in your comparison iteration, this check causes extra branching and a slow down of the algorithm. If you are reading your strings from a file or data stream (ie socket) or from user input, then storing the length before hand will help a lot when optimizing the comparison routine. Now if you just wanted the fastest null terminated unaligned generic string compare. The best you could do is unroll the comparison iteration and make sure the loop label is aligned (8 bytes for 32bit, 16 for 64bit program). |
|||
![]() |
|
revolution 12 Oct 2009, 23:42
If you want higher/lower output results then use the bswap for dword byte-string compares.
Code: mov eax,[source1] mov ecx,[source2] bswap eax bswap ecx cmp eax,ecx jz next_set_of_dwords_in_string ja ... ;above jb ... ;below |
|||
![]() |
|
r22 13 Oct 2009, 00:50
Here is my entry into Pal's optimized string.equals contest.
I optimized for the generic unaligned null terminated string use case. Code: format PE Console entry start include 'win32ax.inc' macro Pad8 { virtual align 8 a = $-$$ end virtual if a=1 db 90h end if if a=2 db 66h,90h end if if a=3 db 66h,66h,90h end if if a=4 db 66h,66h,66h,90h end if if a=5 db 66h,66h,90h,66h,90h end if if a=6 db 66h,66h,90h,66h,66h,90h end if if a=7 db 66h,66h,66h,90h,66h,66h,90h end if } BENCH_CTR = 0FFFFFFh BENCH_RPT = 5 section '.code' code readable executable start: ;;correctness stdcall str.equ, test1,test1 push eax ;T stdcall str.equ, test5,test5 push eax ;T stdcall str.equ, test1,test2 push eax ;F stdcall str.equ, test2,test1 push eax ;F stdcall str.equ, test2,test3 push eax ;F stdcall str.equ, test4,test5 push eax ;F stdcall str.equ, test5,test4 push eax ;F push fmtCorrect call [printf] add esp, 8*4 ;;benchmark mov ebx, BENCH_CTR mov ebp, BENCH_RPT .BENCH_START: call [GetTickCount] push eax Pad8 .BENCH: stdcall str.equ, test1,test1 stdcall str.equ, test5,test5 stdcall str.equ, test1,test2 stdcall str.equ, test2,test1 stdcall str.equ, test2,test3 stdcall str.equ, test4,test5 stdcall str.equ, test5,test4 ;; dec ebx jnz .BENCH call [GetTickCount] pop ecx sub eax,ecx cinvoke printf, <'str.equ: %d ms',13,10,0>,eax mov ebx,BENCH_CTR dec ebp jnz .BENCH_START cinvoke system,<'pause',0> invoke ExitProcess, 0 align 16 ;;************************* ;;Generic null terminated string EQUALS ;;************************* str.equ: ;;(s1, s2) RETURN IF s1==s2 THEN TRUE ELSE FALSE push esi edi mov esi, [esp + 12] mov edi, [esp + 16] xor ecx, ecx movzx eax, byte[esi + ecx] movzx edx, byte[edi + ecx] Pad8 .CMP: inc ecx cmp eax, edx jne .END movzx edx, byte[edi + ecx] test eax, eax jz .END movzx eax, byte[esi + ecx] inc ecx cmp eax, edx jne .END movzx edx, byte[edi + ecx] test eax, eax movzx eax, byte[esi + ecx] jnz .CMP Pad8 .END: mov ecx, 0 mov eax, 1 cmovne eax, ecx pop edi esi ret 8 section '.data' readable writeable test1 db 0 test2 db 'a',0 test3 db 'b',0 test4 db 'test',0 test5 db 'test@',0 fmtCorrect db '0=%d 0=%d 0=%d 0=%d 0=%d 1=%d 1=%d',13,10,0 section '.idata' import data readable writeable library kernel32,'KERNEL32.DLL',\ msvcrt,'MSVCRT.DLL' include 'api/kernel32.inc' import msvcrt,\ printf,'printf',\ system,'system' section '.reloc' fixups data readable discardable |
|||
![]() |
|
r22 14 Oct 2009, 14:47
I guess I win by default ... *cricket* ... *cricket*
|
|||
![]() |
|
pal 14 Oct 2009, 15:45
![]() I changed BENCH_CTR to 111111111 (1 billion/7 rounded down) and I got this output: Code: 0=0 0=0 0=0 0=0 0=0 1=1 1=1 str.equ: 6084 ms str.equ: 6053 ms str.equ: 6084 ms str.equ: 6068 ms str.equ: 6069 ms So so far I would say that your method is the fastest, but I have yet to test it the way I have tested the others... |
|||
![]() |
|
f0dder 21 Oct 2009, 07:27
Fastest string compare?
First, stop using nul-terminated strings ![]() |
|||
![]() |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.