flat assembler
Message board for the users of flat assembler.

Index > Main > Fastest String Comparison

Author
Thread Post new topic Reply to topic
pal



Joined: 26 Aug 2008
Posts: 227
pal
I have noticed a few of you like the mini competitions on here and code optimisation, and I was playing around today. I was wondering what the fastest string comparison code that you could get was.

I did 1 billion repetitions of comparing the two strings "Hello" and "Hello" (yes they are the same), and the quickest time I got was 12.879806 seconds. That code is:

Code:
    proc    strcmp,szOne,szTwo
          xor             eax,eax
             mov             esi,[szOne]
         mov             edx,[szTwo]
         xor             ecx,ecx
     @@: lodsb
               test    al,al
               jz              @F
          cmp             byte [edx+ecx],0
            jz              strcmpEnd
           cmp             al,byte [edx+ecx]
           setne   al
          jne             strcmpEnd
           inc     ecx
         jmp             @B
  @@: cmp             byte [edx+ecx],0
            setne   al
  strcmpEnd:
              xor             eax,1
               ret
 endp
    


I'm sure this can be a lot faster. I tested the speed using this code by the way:

Code:
  push    lpFrequency
 call    [QueryPerformanceFrequency]
 test    eax,eax
     jnz             @F
  ret
 ; Execute the code
@@:   push    lpStart
     call    [QueryPerformanceCounter]
   
    mov             ebx,1000000000
      
@@:     push    szStringTwo
 push    szStringOne
 call    strcmp
      dec             ebx
 jnz             @B
  
    push    lpStop
      call    [QueryPerformanceCounter]
   fild    qword [lpStop]
      fild    qword [lpStart]
     fsubp
       fild    qword [lpFrequency]
 fdivp
       fstp    qword [lpTime]
      push    dword [lpTime+4]
    push    dword [lpTime]
      push    szTime
      call    [printf]
    add             esp,12
      ret 
    


Not sure how good that code is, it was a quick job.

I was wondering what codes you guys can come up with. I did a rep* cmps* one but it was really slow, so I wont even bother posting it.

Any methods are allowed i.e. you can use SSE. Look forward to seeing if anyone posts anything Razz

pal.

P.S. I do realise that my time may be quite slow.

P.P.S. Weird thing, but in my testing code, when I have certain API calls preset the call to QueryPerformanceFrequency messes up. E.g. I have tried it with a MessageBox and also strcmp (from MSVCRT.DLL) and the return is 0 (from the Frequency API), but when they are removed the return goes back to being 1... Strange in my opinion...
Post 10 Oct 2009, 18:58
View user's profile Send private message Reply with quote
kohlrak



Joined: 21 Jul 2006
Posts: 1421
Location: Uncle Sam's Pad
kohlrak
Benchmarks are really, really misleading. For this reason i only use rdtsc anymore and run a ton of runs, but that's not always going to be accurate for that program. The playing field significantly changes for size of program as well (due to caching).

I must point out, though, that you have 3 branches in your code (and branch prediction sucks more than you'd think), and you are using the string functions. The reason cmps is so slow because all the string functions have huge microcode and for some reason, it doesn't seem to get cached right. I've seen other similar phenomena (especially with smaller looping routines) that has led me to believe people when they say, the x86 is optimized for bloat.

edit: a submission:

no i didn't test these, but if they work, tell me how well.

Code:
      mov esi, szOne
      mov edi, szTwo
      and ebx, 0
  and eax, 0
@@:   mov dl, [esi+ebx]
   cmp dl, [edi+ebx]
   jne @f
      inc ebx
     or dl, dl
   jnz @b
@@:       ret
    


Code:
  mov esi, szOne
      mov edi, szTwo
      and eax, 0
@@:   inc esi
     inc edi
     mov dl, [esi]
       cmp dl, [edi]
       jne @f
      or dl, dl
   jnz @b
@@:       ret
    


I have this unfortunate feeling that there's a bug there somewhere...
Post 10 Oct 2009, 20:25
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger Reply with quote
bitshifter



Joined: 04 Dec 2007
Posts: 764
Location: Massachusetts, USA
bitshifter
Conditional moves would speed it up a bit.
Agner fog has a pretty fast set of string procs...
http://www.agner.org/optimize/#asmlib
Get the Subroutine Library and check 'em out.
Post 10 Oct 2009, 21:16
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Well, I couldn't found much in the forum about this (perhaps it was not discussed much actually?). There is more info about strlen but that if of little help.

http://board.flatassembler.net/topic.php?t=2633
Post 10 Oct 2009, 21:34
View user's profile Send private message Reply with quote
pal



Joined: 26 Aug 2008
Posts: 227
pal
kohlrak: Good point about the benchmarks. I will whip up a quick rdtsc code tomorrow. Do you just subtract the final value from the start value to get the number of ticks elapsed in the time? Or do you then convert it to seconds? If so surely use the high precision timers that Windows offers?

I also tested your codes.

First code - 9.934704 seconds.
Second code - 10.894187 seconds.

I edited them slightly so that they are stdcall in a procedure.

bitshifter: Yeah, I think I have seen them before, cheers. I actually thought of him when I came to think of testing methods as he made some codes to test codes if I recall correctly.

Loco: I started this thread as there was like 3 threads in the search when you do "stcmp".

I am gonna have a go at some SSE ones tomorrow. There are some instruction mnemonics for it, but the instructions themselves are only available for i7 processors (SSE4.2 I believe), and I don't have SSE4. Madis linked them on the forum, but I cant remember the link.
Post 10 Oct 2009, 23:32
View user's profile Send private message Reply with quote
kohlrak



Joined: 21 Jul 2006
Posts: 1421
Location: Uncle Sam's Pad
kohlrak
Quote:
kohlrak: Good point about the benchmarks. I will whip up a quick rdtsc code tomorrow. Do you just subtract the final value from the start value to get the number of ticks elapsed in the time? Or do you then convert it to seconds? If so surely use the high precision timers that Windows offers?


I honestly don't feel that any benchmark can truely be accurate. They all vary in speed, and actually rdtsc takes whole miliseconds by itself just to work. I use rdtsc because it's usable on most x86 machines out there today and i don't have to rely on windows to use it. Truth is, most of the advice given years ago on how to optimize the CPU is still valid today, with a few additions. For instance, our "machine code" sort of like x86's equivalent to java's bytecode. This makes compatability easier, but at the cost of some things. Mainly, more commonly used instructions are often faster than the more complex instructions which are specialized for the task (CPU microcode assignment is based on freaquency). It's a real shame, because, IMO, those string operations can save space in the long run (you could make print macro that take up less space than calling a particular function that does the same thing). The only way for specialized instructions to succeed in x86 is if they are improvements that outweigh the disadvantages of the worst microcode they can be assigned. SSE is an example of something that works, mainly because it gives an overall algorithmic advantage, but even still, if you compare the advantages, it's not as big of an advantage as one would actually expect (though still significant, believe me). Theoretically, the improvement could reach 5 times (i'm stretching to 5 times because you don't have to move things to a specific register before pushing onto or popping off a register) the speed of fpu equivalents, though the reality is different because it would require some really expensive hardware and really intellegent math to do that.

Quote:
I also tested your codes.

First code - 9.934704 seconds.
Second code - 10.894187 seconds.


Actually, i expected a little more from the first code, but i kinda figured the second would be that much slower than the first. This shows you that the extra instruction to do increment slows down more than the actual addition itself (decoding process and cache limitations). Almost a whole second gained... Don't forget, though, that the first code doesn't mean an overall faster program. Remember that the second code gives you the benefit of an extra register not being messed with. If you're coding in an environment where you feel that you can rely on that procedure to never change (like making your own kernel or always using that function because you yourself maintain it), you can deffinately benefit from the second function. Depending on how often you use that function and the benefits of having an extra register available, it may or may not be more beneficial to use one over the other. Try to keep this in mind when trying to find the "fastest way" to do anything.
Post 11 Oct 2009, 00:30
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger Reply with quote
pal



Joined: 26 Aug 2008
Posts: 227
pal
I never realised that the string instructions are actually slower in cases as such. Also the problem with using the string instructions is that you have to find the length of the string first. If you don't a code like so:

Code:
mov            esi,[szOne]
mov              edi,[szTwo]
xor              ecx,ecx
not          ecx
repe cmpsb
jne                @F
    


Wont work well as if the strings point to the same location then it will be an infinite (well not infinite) loop as all characters will be the same.

This I have found is also the problem with SSE. You will have to find the length of the string, unless there is some instruction mnemonic I am not aware of. I am using pcmpeqb and pmovmskb to compare the strings, but again this could turn into an infinite loop.

Any solutions to this? The only thing I can think of is checking if the next byte is 0 or not, but this means doing a byte by byte comparison, and hence there is no need for SSE. I cannot think of any way to compare more than one byte at once.
Post 11 Oct 2009, 11:24
View user's profile Send private message Reply with quote
kohlrak



Joined: 21 Jul 2006
Posts: 1421
Location: Uncle Sam's Pad
kohlrak
I'm not sure about sse (i'm not very good with that myself), but for regular string functions, your rep won't work unless, as you said, you have your string lengths. So, take it out and make a regular loop, determined by whether or not either [edi] or [esi] is 0 for that run (doesn't matter if you compare the null terminator or not, since if the strings are equal, they should be 0... actually you should do that anyway, because if you have "hello" and "hello2," they're not the same but if you don't compare the null terminators, they'll come back as the same.)
Post 11 Oct 2009, 13:14
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
If your going to benchmark code
- make sure it's all in the same format/calling convention.
- make sure all the variants are CORRECT in ALL CASES.
;; INVOKE str.cmp, <'test',0>, <'test',0>
;; = TRUE
;; INVOKE str.cmp, <0>, <0>
;; = TRUE
;; INVOKE str.cmp, <'a',0>, <0>
;; = FALSE
;; INVOKE str.cmp, <0>, <'a',0>
;; = FALSE
;; INVOKE str.cmp, <'testa',0>, <'testb',0>
;; = FALSE
;; INVOKE str.cmp, <'testa',0>, <'testab',0>
;; = FALSE
;; INVOKE str.cmp, <'testab',0>, <'testa',0>
;; = FALSE
- define the tested input.

If your testing a strings the input should be pseudo randomly selected strings of varying length. Whats the mean length of the strings you think will be used as input? If the strings are short unroll the loop and inline it, if they are long then a separate function will be necessary. Will the strings be padded to a certain alignment for trailing NULLs? Not padded <'Hello',0> vs padded to dword <'Hello',0,0,0> the latter can be optimized much better.
Post 12 Oct 2009, 15:48
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
yeah. Hold the string length as a prefix instead of a null terminator.

_________________
Previously known as The_Grey_Beast
Post 12 Oct 2009, 19:35
View user's profile Send private message Reply with quote
pal



Joined: 26 Aug 2008
Posts: 227
pal
r22 wrote:
- make sure it's all in the same format/calling convention.


Done. I have made them all into stdcall. I haven't tested the strcmp in MSVCRT.DLL, although I have a feeling is it quite bloated.

r22 wrote:
- make sure all the variants are CORRECT in ALL CASES.


I didn't check that many cases. A good option I guess would be to use a PRNG to input values into it for x amount of times and check to make sure it works.

r22 wrote:
If your testing a strings the input should be pseudo randomly selected strings of varying length.


Surely the data inputted has to be the same for each test otherwise you could have one function comparing a 9 character string whilst one compares a 4 character string. Do you do like a large amount of tests then average it out?

Heh, the above you answered it in the next sentence. I should learn to read ahead.

r22 wrote:
If the strings are short unroll the loop and inline it, if they are long then a separate function will be necessary.


Yeah, I guess two functions could be made and a macro determines which function shall be used, or something like that?

r22 wrote:
Will the strings be padded to a certain alignment for trailing NULLs? Not padded <'Hello',0> vs padded to dword <'Hello',0,0,0> the latter can be optimized much better.


I was just doing with one null byte after, although padding to dword can be done easily I guess. What advantage does this present over not doing this?
Post 12 Oct 2009, 20:42
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
@pal
Padding and optimization

If you know that your parameters are null padded to dword (4 bytes) then you can compare your strings 4 characters at a time.
Code:
MOV EAX, dword[ESI + ECX * 4]
CMP EAX, dword[EDI + ECX * 4]
    

4 times as much work gets done per iteration.

Borsuc's point about storing the strings length can also allow for greater optimization. Because you don't know the lengths of your zero terminated strings you need to do a check for null in your comparison iteration, this check causes extra branching and a slow down of the algorithm.

If you are reading your strings from a file or data stream (ie socket) or from user input, then storing the length before hand will help a lot when optimizing the comparison routine.

Now if you just wanted the fastest null terminated unaligned generic string compare. The best you could do is unroll the comparison iteration and make sure the loop label is aligned (8 bytes for 32bit, 16 for 64bit program).
Post 12 Oct 2009, 23:35
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17715
Location: In your JS exploiting you and your system
revolution
If you want higher/lower output results then use the bswap for dword byte-string compares.
Code:
mov eax,[source1]
mov ecx,[source2]
bswap eax
bswap ecx
cmp eax,ecx
jz next_set_of_dwords_in_string
ja ... ;above
jb ... ;below    
Useful for sorting algorithms.
Post 12 Oct 2009, 23:42
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
Here is my entry into Pal's optimized string.equals contest.

I optimized for the generic unaligned null terminated string use case.

Code:
format PE Console
entry start 
 
include 'win32ax.inc'

macro Pad8
{
    virtual
        align 8
        a = $-$$
    end virtual
    if a=1
       db 90h
    end if
    if a=2
       db 66h,90h
    end if
    if a=3
       db 66h,66h,90h
    end if
    if a=4
       db 66h,66h,66h,90h
    end if
    if a=5
       db 66h,66h,90h,66h,90h
    end if
    if a=6
       db 66h,66h,90h,66h,66h,90h
    end if
    if a=7
       db 66h,66h,66h,90h,66h,66h,90h
    end if
}

BENCH_CTR = 0FFFFFFh
BENCH_RPT = 5
section '.code' code readable executable
 
  start:
;;correctness
        stdcall str.equ, test1,test1
        push    eax ;T
        stdcall str.equ, test5,test5
        push    eax ;T
        stdcall str.equ, test1,test2
        push    eax ;F
        stdcall str.equ, test2,test1
        push    eax ;F
        stdcall str.equ, test2,test3
        push    eax ;F
        stdcall str.equ, test4,test5
        push    eax ;F
        stdcall str.equ, test5,test4
        push    eax ;F
        push    fmtCorrect
        call    [printf]
        add     esp, 8*4
;;benchmark
        mov     ebx, BENCH_CTR
        mov     ebp, BENCH_RPT
.BENCH_START:
        call    [GetTickCount]
        push    eax
Pad8
.BENCH:
        stdcall str.equ, test1,test1
        stdcall str.equ, test5,test5
        stdcall str.equ, test1,test2
        stdcall str.equ, test2,test1
        stdcall str.equ, test2,test3
        stdcall str.equ, test4,test5
        stdcall str.equ, test5,test4
        ;;
        dec     ebx
        jnz     .BENCH
        call    [GetTickCount]
        pop     ecx
        sub     eax,ecx
        cinvoke printf, <'str.equ: %d ms',13,10,0>,eax
        mov     ebx,BENCH_CTR
        dec     ebp
        jnz     .BENCH_START


        cinvoke system,<'pause',0>
        invoke  ExitProcess, 0

align 16
;;*************************
;;Generic null terminated string EQUALS
;;*************************
str.equ: ;;(s1, s2) RETURN IF s1==s2 THEN TRUE ELSE FALSE
        push    esi edi
        mov     esi, [esp + 12]
        mov     edi, [esp + 16]
        xor     ecx, ecx
        movzx   eax, byte[esi + ecx]
        movzx   edx, byte[edi + ecx]
Pad8
.CMP:
        inc     ecx
        cmp     eax, edx
        jne     .END
        movzx   edx, byte[edi + ecx]
        test    eax, eax
        jz      .END
        movzx   eax, byte[esi + ecx]
        inc     ecx
        cmp     eax, edx
        jne     .END
        movzx   edx, byte[edi + ecx]
        test    eax, eax
        movzx   eax, byte[esi + ecx]
        jnz     .CMP
Pad8
.END:
        mov     ecx, 0
        mov     eax, 1
        cmovne  eax, ecx
        pop     edi esi
        ret 8

section '.data' readable writeable
test1 db 0
test2 db 'a',0
test3 db 'b',0
test4 db 'test',0
test5 db 'test@',0

fmtCorrect db '0=%d 0=%d 0=%d 0=%d 0=%d 1=%d 1=%d',13,10,0

section '.idata' import data readable writeable
 
  library kernel32,'KERNEL32.DLL',\
          msvcrt,'MSVCRT.DLL'
         
  include 'api/kernel32.inc' 

  import msvcrt,\
         printf,'printf',\
         system,'system' 
 
section '.reloc' fixups data readable discardable
    
Post 13 Oct 2009, 00:50
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
I guess I win by default ... *cricket* ... *cricket*
Post 14 Oct 2009, 14:47
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
pal



Joined: 26 Aug 2008
Posts: 227
pal
Razz Sorry, I was trying to figure out how to get your code into a form like I have called all the rest with.

I changed BENCH_CTR to 111111111 (1 billion/7 rounded down) and I got this output:

Code:
0=0 0=0 0=0 0=0 0=0 1=1 1=1
str.equ: 6084 ms
str.equ: 6053 ms
str.equ: 6084 ms
str.equ: 6068 ms
str.equ: 6069 ms
    


So so far I would say that your method is the fastest, but I have yet to test it the way I have tested the others...
Post 14 Oct 2009, 15:45
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
Fastest string compare?

First, stop using nul-terminated strings Smile
Post 21 Oct 2009, 07:27
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.