flat assembler
Message board for the users of flat assembler.

Index > Tutorials and Examples > sse 4.2 and sse 2 strcmp and strlen

Author
Thread Post new topic Reply to topic
Ali.Z



Joined: 08 Jan 2018
Posts: 819
Ali.Z 10 Jun 2024, 22:27
sse 4.2:
https://www.strchr.com/strcmp_and_strlen_using_sse_4.2

sse 2:
https://github.com/aklomp/sse-strings/blob/master/src/strcmpeq_sse2.asm
same repo also contain sse 4.2, but much more compact than the above sse 4.2 implementation.

_________________
Asm For Wise Humans
Post 10 Jun 2024, 22:27
View user's profile Send private message Reply with quote
Jessé



Joined: 03 May 2025
Posts: 20
Jessé 11 May 2025, 21:19
I know this is old topic, but it is very interesting.
I've create some time ago an application that enables me to benchmark instructions, procedures, etc.. in pure assembly code. I think I'll try compare these two with my GP champion here:

Code:
        Fast_StrCmp             endbr64                 ; rdi = string1, rsi = string2
                                mov     rdx, rdi
                                mov     ecx, -1         ; DANGER! (Up to 4GB string comparison!)
                                xor     al, al
                                repne   scasb           ; determine length
                                not     ecx             ; Length including null termination
                                mov     rdi, rdx
                                repe    cmpsb
                                sete    al              ; Return boolean (al): TRUE = strings are equal
                                ret                     ; FALSE = strings aren't equal
                                                        ; or, if you prefer, Z flag contains the result
                                                        ; Z = 1 equal; Z = 0 not equal

    
Post 11 May 2025, 21:19
View user's profile Send private message Reply with quote
AsmGuru62



Joined: 28 Jan 2004
Posts: 1708
Location: Toronto, Canada
AsmGuru62 11 May 2025, 22:59
It looks like this function detects if strings are equal or not, so it cannot be used for sorting strings.
But that is OK, still a good implementation.
Once someone posted a function to get STRLEN with the code using 4 byte comparison at a time.
I decided to measure if REPNE SCASB will be faster.
Curious fact, but REPNE SCASB principle was slower by around 30%!
Just interesting stuff!
The function scans every character twice from the looks of it: first SCASB, then CMPSB.
I would have went for a single pass over the buffers:
Code:
align 16
String_IsEqual:
; ---------------------------------------------------------------------------
; Input:
;   rsi = pointer to buffer #1
;   rdi = pointer to buffer #2
; Output:
;    CF = 1 for EQUAL (easier to JNC/JC after CALL)
; ---------------------------------------------------------------------------
    xor     ecx, ecx    ; index into both buffers
@@:
    mov     al, [rsi + rcx]
    mov     dl, [rdi + rcx]

    cmp     al, dl
    jne     .not_equal

    test    al, al
    jz      .equal

    inc     ecx
    jmp     @r

.equal:
    stc
    ret

.not_equal:
    clc
    ret
    
Post 11 May 2025, 22:59
View user's profile Send private message Send e-mail Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 1145
Location: Russia
macomics 12 May 2025, 03:15
Code:
; Input:
;   rsi = pointer to buffer #1
;   rdi = pointer to buffer #2
; Output:
;   rfl = result of comparision, rax = length
;   easier to JNZ/JZ/JA/JB/etc after CALL
; Uses: rcx
; Works in both directions (DF)
align 16
String_IsEqual:
    push rsi
    xor  ecx, ecx
    push rdi
    sub  rcx, rsi
  @@:
    lods byte [rsi]
    scas byte [rdi]
    jnz  @f
    or   al, al
    jnz  @b
  @@:
    lea  rax, [rsi+rcx] ; length (add will affect flags)
    pop  rdi
    pop  rsi
    retn    
Post 12 May 2025, 03:15
View user's profile Send private message Reply with quote
Jessé



Joined: 03 May 2025
Posts: 20
Jessé 12 May 2025, 08:51
I've tested all 3 of them, scanning 128 byte equal buffers (so they are guaranteed to go equally to the end), function is called with rdi and rsi pointers, strings in different locations, and the results are:

1st: AsmGuru62 -> 255 cycles average*
2nd: Jessé -> 570 cycles average*
3rd: macomics -> 652 cycles average*

While reading the history and evolution of x86 processors some hours ago, I've figured out that string instructions are heavy weight on the early models, and, indeed, this continue to be slightly true nowadays.
Tests were on my AMD Ryzen 7 4800HS, soon I'll test on my Intel (Core i5 and Core i7) too see how they differ. Cycle by cycle, Intel usually wins most of the time whenever instructions went outside the basics.

The measurements are related do TSC, clock is limited to 3608 MHz maximum.

Well done, AsmGuru62!

I'll review this against the SSE4.x and SSE2 codes and come back here sooner.

* the tests were made under ring3, multitask environment, 1 single thread forced on the same CPU, 200 million loops.

Edit: I must add that testing on a multitask environment leads to coarse measurement. So, this should not be accounted to be 100% precise.
Post 12 May 2025, 08:51
View user's profile Send private message Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 1938
Roman 12 May 2025, 09:08
Code:
eqTxt:
    xor     ecx,ecx    ; index into both buffers
    xor     edx,edx
@@:
    mov     al, [rsi + rcx] ;we can compare rax(8 bytes) or eax(4 bytes) or ax(2 bytes)
    cmp     al, [rdi + rcx]

    jne     .not_equal
    inc     ecx
    cmp     al, ' '      ;if any symbol 13,10,32,9=tab,0 done.
    ja      @b


.equal:
    inc dl
.not_equal:
    ret

txt    db 'classics movies'
find   db 'classics '
;in code
mov rsi,txt
mov rdi,find
call eqTxt
cmp dl,1
jnz @f
MsgBox 'equal'
@@:
    


Last edited by Roman on 13 May 2025, 07:04; edited 3 times in total
Post 12 May 2025, 09:08
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20621
Location: In your JS exploiting you and your system
revolution 12 May 2025, 09:57
If you use sub al, [address] instead of CMP then you can return AL directly and the value in AL will indicate the difference above=+/below=-/equal=0, and the vaue in RCX can be the position of difference. Can be useful for sorting.
Post 12 May 2025, 09:57
View user's profile Send private message Visit poster's website Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 1938
Roman 12 May 2025, 10:27
Quote:
sub al , cmp, ecx

In good case we needed one asm command cmptxt Smile
And not does not spoil the registers. Only set z flag.
Code:
symbol equ !
cmptxt txt1,txt2,symbol ;exit if 0 or symbol , or . or | or @ or # or space

    

lazy Intel.


Last edited by Roman on 12 May 2025, 11:07; edited 5 times in total
Post 12 May 2025, 10:27
View user's profile Send private message Reply with quote
Jessé



Joined: 03 May 2025
Posts: 20
Jessé 12 May 2025, 10:30
In the meantime, I've decided also include libc strcmp() function in my tests. And, the results were horrible (for us!)... 😂
The libc function wins by being 30 ~ 40 x faster than our challengers here. But there's a catch: in my system, libc is using AVX2 to do stuff, and, indeed, it uses a very clever way of comparing strings, using vpcmpeqb, also comparing them to 0 to locate the null termination, and using tzcnt to locate end of string. Also it has some weirdness in it, because they just jump to a block that is equal to the AVX2 processing block, but in a different location, 3 times! They also get length out of it, I suppose.
Its win is because, in every chunk, it can process 32 bytes of data. So, if you multiply by 32, the results were similar to what we've reached here, in a per byte basis.
But a win is a win! And this means the guy doing libc C code probably eats AVX in his breakfast every day! 🤓
I'll try to create a version of it using SSEx myself, before I read the docs above. Just for fun, and self-taught learning.
Debugging is life, guys!

"In tempo": I'm quite used to see weird jumps in C generated code, and know this is a compiler thing, rather than been programactically intentional.
Post 12 May 2025, 10:30
View user's profile Send private message Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 1938
Roman 12 May 2025, 10:43
Some users not have cpu with avx2
Good solution check CPU avx2,avx,sse support.
And have several functions for sse,avx,avx2 and simple variant.

cmpTxtProc dd eqTxt ;if cpu supported avx2 set this to avx2cmpStrings

;in code do
call [cmpTxtProc ]
Post 12 May 2025, 10:43
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20621
Location: In your JS exploiting you and your system
revolution 12 May 2025, 10:51
All code needs to be tuned to the CPU/GPU and the data set it is run on.

There is no generic code that can outperform custom tuned code specifically designed for its narrow purpose.
Post 12 May 2025, 10:51
View user's profile Send private message Visit poster's website Reply with quote
Ali.Z



Joined: 08 Jan 2018
Posts: 819
Ali.Z 12 May 2025, 10:58
Roman wrote:
Some users not have cpu with avx2


correct, SSEx should be bare minimum IMO.

Windows 2000 had support for SSE2, Windows 98 had support for SSE1.

_________________
Asm For Wise Humans
Post 12 May 2025, 10:58
View user's profile Send private message Reply with quote
Jessé



Joined: 03 May 2025
Posts: 20
Jessé 12 May 2025, 11:02
Yep. I guess libc detects and tunes itself accordingly to what is available on every system.
Or maybe the dynamic linker does it beforehand, but, this is unknown territory for me for now. I can only speculate about.
But have in mind that cpuid is a heavy instruction, so, put it on a startup and set some flags in a variable to which is avaible will be a better approach than calling it to check for things frequently. In my tests, it consumes between 100 to 200 cycles (again based in TSC) for every leaf.
I usually do this in all my applications for cpuid features.
Post 12 May 2025, 11:02
View user's profile Send private message Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 1145
Location: Russia
macomics 12 May 2025, 11:23
revolution wrote:
If you use sub al, [address] instead of CMP then you can return AL directly and the value in AL will indicate the difference above=+/below=-/equal=0, and the vaue in RCX can be the position of difference. Can be useful for sorting.
It is enough to do this and everything will be in the flags at once.
Code:
strcmp:
    or rcx, -1
@@:
    inc     rcx
    mov     al, [rsi + rcx]
    cmp     al, [rdi + rcx]
    jne     @f
    cmp     al, 0        ;if any symbol
    jnz      @b
@@:
    retn

    lea rsi, [str0]
    lea rdi, [str1]
    call strcmp
    ja .strabove
    jc .strbelow
  .strequal:
...
  .strabove:
...
  .strbelow:
...    
Post 12 May 2025, 11:23
View user's profile Send private message Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 1145
Location: Russia
macomics 12 May 2025, 11:35
Jessé wrote:
But have in mind that cpuid is a heavy instruction, so, put it on a startup and set some flags in a variable to which is avaible will be a better approach than calling it to check for things frequently. In my tests, it consumes between 100 to 200 cycles (again based in TSC) for every leaf.

Here Roman has already shown you how to do it better instead of using flags.

After cpuid, you can simply configure a set of variables to perform indirect function calls.
Roman wrote:
Code:
cmpTxtProc dd eqTxt ;if cpu supported avx2 set this to avx2cmpStrings

;in code do
call [cmpTxtProc ]    
Post 12 May 2025, 11:35
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.