sse 4.2 and sse 2 strcmp and strlen

Index > Tutorials and Examples > sse 4.2 and sse 2 strcmp and strlen

Author

Thread

Ali.Z

Joined: 08 Jan 2018
Posts: 846

Ali.Z 10 Jun 2024, 22:27

sse 4.2:
https://www.strchr.com/strcmp_and_strlen_using_sse_4.2

sse 2:
https://github.com/aklomp/sse-strings/blob/master/src/strcmpeq_sse2.asm
same repo also contain sse 4.2, but much more compact than the above sse 4.2 implementation.

_________________
Asm For Wise Humans

10 Jun 2024, 22:27

Jessé

Joined: 03 May 2025
Posts: 79
Location: Brazil

Jessé 11 May 2025, 21:19

I know this is old topic, but it is very interesting.
I've create some time ago an application that enables me to benchmark instructions, procedures, etc.. in pure assembly code. I think I'll try compare these two with my GP champion here:

Code:

        Fast_StrCmp             endbr64                 ; rdi = string1, rsi = string2
                                mov     rdx, rdi
                                mov     ecx, -1         ; DANGER! (Up to 4GB string comparison!)
                                xor     al, al
                                repne   scasb           ; determine length
                                not     ecx             ; Length including null termination
                                mov     rdi, rdx
                                repe    cmpsb
                                sete    al              ; Return boolean (al): TRUE = strings are equal
                                ret                     ; FALSE = strings aren't equal
                                                        ; or, if you prefer, Z flag contains the result
                                                        ; Z = 1 equal; Z = 0 not equal

11 May 2025, 21:19

AsmGuru62

Joined: 28 Jan 2004
Posts: 1770
Location: Toronto, Canada

AsmGuru62 11 May 2025, 22:59

It looks like this function detects if strings are equal or not, so it cannot be used for sorting strings.
But that is OK, still a good implementation.
Once someone posted a function to get STRLEN with the code using 4 byte comparison at a time.
I decided to measure if REPNE SCASB will be faster.
Curious fact, but REPNE SCASB principle was slower by around 30%!
Just interesting stuff!
The function scans every character twice from the looks of it: first SCASB, then CMPSB.
I would have went for a single pass over the buffers:

Code:

align 16
String_IsEqual:
; ---------------------------------------------------------------------------
; Input:
;   rsi = pointer to buffer #1
;   rdi = pointer to buffer #2
; Output:
;    CF = 1 for EQUAL (easier to JNC/JC after CALL)
; ---------------------------------------------------------------------------
    xor     ecx, ecx    ; index into both buffers
@@:
    mov     al, [rsi + rcx]
    mov     dl, [rdi + rcx]

    cmp     al, dl
    jne     .not_equal

    test    al, al
    jz      .equal

    inc     ecx
    jmp     @r

.equal:
    stc
    ret

.not_equal:
    clc
    ret

11 May 2025, 22:59

macomics

Joined: 26 Jan 2021
Posts: 1208
Location: Russia

macomics 12 May 2025, 03:15

Code:

; Input:
;   rsi = pointer to buffer #1
;   rdi = pointer to buffer #2
; Output:
;   rfl = result of comparision, rax = length
;   easier to JNZ/JZ/JA/JB/etc after CALL
; Uses: rcx
; Works in both directions (DF)
align 16
String_IsEqual:
    push rsi
    xor  ecx, ecx
    push rdi
    sub  rcx, rsi
  @@:
    lods byte [rsi]
    scas byte [rdi]
    jnz  @f
    or   al, al
    jnz  @b
  @@:
    lea  rax, [rsi+rcx] ; length (add will affect flags)
    pop  rdi
    pop  rsi
    retn

12 May 2025, 03:15

Jessé

Joined: 03 May 2025
Posts: 79
Location: Brazil

Jessé 12 May 2025, 08:51

I've tested all 3 of them, scanning 128 byte equal buffers (so they are guaranteed to go equally to the end), function is called with rdi and rsi pointers, strings in different locations, and the results are:

1st: AsmGuru62 -> 255 cycles average*
2nd: Jessé -> 570 cycles average*
3rd: macomics -> 652 cycles average*

While reading the history and evolution of x86 processors some hours ago, I've figured out that string instructions are heavy weight on the early models, and, indeed, this continue to be slightly true nowadays.
Tests were on my AMD Ryzen 7 4800HS, soon I'll test on my Intel (Core i5 and Core i7) too see how they differ. Cycle by cycle, Intel usually wins most of the time whenever instructions went outside the basics.

The measurements are related do TSC, clock is limited to 3608 MHz maximum.

Well done, AsmGuru62!

I'll review this against the SSE4.x and SSE2 codes and come back here sooner.

* the tests were made under ring3, multitask environment, 1 single thread forced on the same CPU, 200 million loops.

Edit: I must add that testing on a multitask environment leads to coarse measurement. So, this should not be accounted to be 100% precise.

12 May 2025, 08:51

Roman

Joined: 21 Apr 2012
Posts: 2048

Roman 12 May 2025, 09:08

Code:

eqTxt:
    xor     ecx,ecx    ; index into both buffers
    xor     edx,edx
@@:
    mov     al, [rsi + rcx] ;we can compare rax(8 bytes) or eax(4 bytes) or ax(2 bytes)
    cmp     al, [rdi + rcx]

    jne     .not_equal
    inc     ecx
    cmp     al, ' '      ;if any symbol 13,10,32,9=tab,0 done.
    ja      @b


.equal:
    inc dl
.not_equal:
    ret

txt    db 'classics movies'
find   db 'classics '
;in code
mov rsi,txt
mov rdi,find
call eqTxt
cmp dl,1
jnz @f
MsgBox 'equal'
@@:

Last edited by Roman on 13 May 2025, 07:04; edited 3 times in total

12 May 2025, 09:08

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20864
Location: In your JS exploiting you and your system

revolution 12 May 2025, 09:57

If you use sub al, [address] instead of CMP then you can return AL directly and the value in AL will indicate the difference above=+/below=-/equal=0, and the vaue in RCX can be the position of difference. Can be useful for sorting.

12 May 2025, 09:57

Roman

Joined: 21 Apr 2012
Posts: 2048

Roman 12 May 2025, 10:27

Quote:

sub al , cmp, ecx

In good case we needed one asm command cmptxt Smile

And not does not spoil the registers. Only set z flag.

Code:

symbol equ !
cmptxt txt1,txt2,symbol ;exit if 0 or symbol , or . or | or @ or # or space

lazy Intel.

Last edited by Roman on 12 May 2025, 11:07; edited 5 times in total

12 May 2025, 10:27

Jessé

Joined: 03 May 2025
Posts: 79
Location: Brazil

Jessé 12 May 2025, 10:30

In the meantime, I've decided also include libc strcmp() function in my tests. And, the results were horrible (for us!)... 😂
The libc function wins by being 30 ~ 40 x faster than our challengers here. But there's a catch: in my system, libc is using AVX2 to do stuff, and, indeed, it uses a very clever way of comparing strings, using vpcmpeqb, also comparing them to 0 to locate the null termination, and using tzcnt to locate end of string. Also it has some weirdness in it, because they just jump to a block that is equal to the AVX2 processing block, but in a different location, 3 times! They also get length out of it, I suppose.
Its win is because, in every chunk, it can process 32 bytes of data. So, if you multiply by 32, the results were similar to what we've reached here, in a per byte basis.
But a win is a win! And this means the guy doing libc C code probably eats AVX in his breakfast every day! 🤓
I'll try to create a version of it using SSEx myself, before I read the docs above. Just for fun, and self-taught learning.
Debugging is life, guys!

"In tempo": I'm quite used to see weird jumps in C generated code, and know this is a compiler thing, rather than been programactically intentional.

12 May 2025, 10:30

Roman

Joined: 21 Apr 2012
Posts: 2048

Roman 12 May 2025, 10:43

Some users not have cpu with avx2
Good solution check CPU avx2,avx,sse support.
And have several functions for sse,avx,avx2 and simple variant.

cmpTxtProc dd eqTxt ;if cpu supported avx2 set this to avx2cmpStrings

;in code do
call [cmpTxtProc ]

12 May 2025, 10:43

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20864
Location: In your JS exploiting you and your system

revolution 12 May 2025, 10:51

All code needs to be tuned to the CPU/GPU and the data set it is run on.

There is no generic code that can outperform custom tuned code specifically designed for its narrow purpose.

12 May 2025, 10:51

Ali.Z

Joined: 08 Jan 2018
Posts: 846

Ali.Z 12 May 2025, 10:58

Roman wrote:

Some users not have cpu with avx2

correct, SSEx should be bare minimum IMO.

Windows 2000 had support for SSE2, Windows 98 had support for SSE1.

_________________
Asm For Wise Humans

12 May 2025, 10:58

Jessé

Joined: 03 May 2025
Posts: 79
Location: Brazil

Jessé 12 May 2025, 11:02

Yep. I guess libc detects and tunes itself accordingly to what is available on every system.
Or maybe the dynamic linker does it beforehand, but, this is unknown territory for me for now. I can only speculate about.
But have in mind that cpuid is a heavy instruction, so, put it on a startup and set some flags in a variable to which is avaible will be a better approach than calling it to check for things frequently. In my tests, it consumes between 100 to 200 cycles (again based in TSC) for every leaf.
I usually do this in all my applications for cpuid features.

12 May 2025, 11:02

macomics

Joined: 26 Jan 2021
Posts: 1208
Location: Russia

macomics 12 May 2025, 11:23

revolution wrote:

If you use sub al, [address] instead of CMP then you can return AL directly and the value in AL will indicate the difference above=+/below=-/equal=0, and the vaue in RCX can be the position of difference. Can be useful for sorting.

It is enough to do this and everything will be in the flags at once.

Code:

strcmp:
    or rcx, -1
@@:
    inc     rcx
    mov     al, [rsi + rcx]
    cmp     al, [rdi + rcx]
    jne     @f
    cmp     al, 0        ;if any symbol
    jnz      @b
@@:
    retn

    lea rsi, [str0]
    lea rdi, [str1]
    call strcmp
    ja .strabove
    jc .strbelow
  .strequal:
...
  .strabove:
...
  .strbelow:
...

12 May 2025, 11:23

macomics

Joined: 26 Jan 2021
Posts: 1208
Location: Russia

macomics 12 May 2025, 11:35

Jessé wrote:

But have in mind that cpuid is a heavy instruction, so, put it on a startup and set some flags in a variable to which is avaible will be a better approach than calling it to check for things frequently. In my tests, it consumes between 100 to 200 cycles (again based in TSC) for every leaf.

Here Roman has already shown you how to do it better instead of using flags.

After cpuid, you can simply configure a set of variables to perform indirect function calls.

Roman wrote:

Code:

cmpTxtProc dd eqTxt ;if cpu supported avx2 set this to avx2cmpStrings

;in code do
call [cmpTxtProc ]

12 May 2025, 11:35

wht36

Joined: 18 Sep 2005
Posts: 110

wht36 09 Dec 2025, 05:26

For a basic memcmp without needing to know which is bigger, I use this:

Code:

align 16
memcmp: ; RCX left, RDX right, R8 length
        ; simple memcmp only returns match/no match
        ; returns RAX 0 if match; r9 destroyed
        xor r9,r9
.cmp:   cmp r9,r8
        jnb .eq
        vmovdqu ymm0, [rcx+r9]                  ; loop comparing 32 bytes
        vpcmpeqb ymm0, ymm0, [rdx+r9]   ; compare 32 bytes, set to corresponding byte to 1 if equal
        vpmovmskb eax, ymm0     ; convert result to byte mask to find position of mismatch
        xor eax, -1                     ; not eax would not set flags
        jnz .diff                       ; difference found
        add r9,32                       ; match, try compare next 32 bytes
        jmp .cmp
.diff:  tzcnt eax, eax          ; get position of first bit set
        add eax,r9d
        cmp eax,r8d                     ; is mismatch within length?
        jbe .ret
.eq:    xor eax,eax
.ret:   vzeroupper                      ; zero upper bits of ymm and zmm registers
        ret

09 Dec 2025, 05:26

bitRAKE

Joined: 21 Jul 2003
Posts: 4364
Location: vpcmpistri

bitRAKE 09 Dec 2025, 08:21

Plug your memcmp into this test program - there are three error cases:

Code:

format PE64 console
entry start

include 'win64ax.inc'

section '.data' data readable writeable
    ; Simulate buffers on page boundaries to test strict alignment safety:
    align 4096
    buf1 db 4096 dup 'A'
    buf2 db 4096 dup 'A' 
    
    msg_pass db 'PASS', 13, 10, 0
    msg_fail db 'FAIL (Expected match=%d, Got RAX=%d)', 13, 10, 0
    msg_head db 'Test Len: %-3d | Diff @ %-3d : ', 0

section '.text' code readable executable

start:
    sub     rsp, 40         ; Shadow space

    ; ---------------------------------------------------------
    ; Macro: run_test
    ; params: length, diff_index (-1 for exact match)
    ; ---------------------------------------------------------
    macro run_test len, diff_idx {
        ; Declare all jump targets as local to this expansion
        local fail, pass, done
        
        ; 1. Print Header
        lea     rcx, [msg_head]
        mov     rdx, len
        mov     r8, diff_idx
        invoke  printf, rcx, rdx, r8

        ; 2. Inject Mismatch 
        if diff_idx >= 0
            mov     byte [buf2 + diff_idx], 'B'
        end if

        ; 3. Setup Call
        lea     rcx, [buf1]
        lea     rdx, [buf2]
        mov     r8, len
        call    memcmp_avx2
        mov     rbx, rax

        ; 4. Restore Buffer
        if diff_idx >= 0
            mov     byte [buf2 + diff_idx], 'A'
        end if

        ; 5. Verify Logic
        if diff_idx = -1
            test    rbx, rbx
            jz      pass
        else
            test    rbx, rbx
            jnz     pass
        end if

    fail:
        lea     rcx, [msg_fail]
        mov     rdx, diff_idx
        mov     r8, rbx
        invoke  printf
        jmp     done
    pass:
        invoke  printf, msg_pass
    done:
    }

    ; --- Test Suite ---

    ; 1. Zero Length
    run_test 0, -1          ; Should Match

    ; 2. Small Buffers (< 32, Scalar Path)
    run_test 10, -1         ; Match
    run_test 10, 0          ; Diff @ 0 (Catches the "tzcnt returns 0" bug)
    run_test 10, 9          ; Diff @ End

    ; 3. Exact Vector Size (32 bytes)
    run_test 32, -1         ; Match
    run_test 32, 31         ; Diff @ End

    ; 4. Large Buffers (> 32, AVX Path + Tail)
    run_test 64, -1         ; Match
    run_test 64, 0          ; Diff @ Start (AVX path)
    run_test 64, 32         ; Diff @ Middle
    run_test 64, 63         ; Diff @ Tail (Overlapping load check)

    ; 5. Page Boundary Safety Simulation
    ; Testing a 5-byte read at the very end of a 4096-byte page.
    ; If the AVX path is wrongly taken, this would read past buf1 into buf2 (or crash).
    invoke  printf, <"Boundary Check (Len 5 at offset 4091): ", 0>
    lea     rcx, [buf1 + 4096 - 5]
    lea     rdx, [buf2 + 4096 - 5]
    mov     r8, 5
    call    memcmp_avx2
    test    rax, rax
    jz      .safe_pass
    invoke  printf, msg_fail, -1, rax
    jmp     .exit
.safe_pass:
    invoke  printf, msg_pass

.exit:
    invoke  ExitProcess, 0

    ; ...
    ; Paste memcmp_avx2 here
    ; ...

section '.idata' import data readable writeable
    library kernel32,'KERNEL32.DLL', msvcrt,'MSVCRT.DLL'
    include 'api\kernel32.inc'
    import msvcrt, printf, 'printf'

... could also setup guard page to find boundary error.

_________________
¯\(°_o)/¯ AI may [not] have aided with the above reply.

09 Dec 2025, 08:21

wht36

Joined: 18 Sep 2005
Posts: 110

wht36 09 Dec 2025, 14:16

Ah thanks, hopefully fixed the bugs now but still no boundary check so can cause page fault.

Code:

memcmp_avx2: ; RCX left, RDX right, R8 length
        ; simple memcmp only returns match/no match
        ; returns RAX 0 if match; r9 destroyed
        xor r9,r9
.cmp:   cmp r9,r8
        jnb .eq
        vmovdqu ymm0, [rcx+r9]                  ; loop comparing 32 bytes
        vpcmpeqb ymm0, ymm0, [rdx+r9]   ; compare 32 bytes, set to corresponding byte to 1 if equal
        vpmovmskb eax, ymm0     ; convert result to byte mask to find position of mismatch
        xor eax, -1                     ; not eax would not set flags
        jnz .diff                       ; difference found
        add r9,32                       ; match, try compare next 32 bytes
        jmp .cmp
.eq:    xor eax,eax
        vzeroupper                      ; zero upper bits of ymm and zmm registers
        ret
.diff:  tzcnt eax, eax          ; get position of first bit set
        add eax,r9d
        cmp eax,r8d                     ; is mismatch within length?
        jae .eq
        or al,1
        vzeroupper                      ; zero upper bits of ymm and zmm registers
        ret

09 Dec 2025, 14:16

< Last Thread | Next Thread >

Forum Rules:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum