flat assembler
Message board for the users of flat assembler.
Index
> Tutorials and Examples > sse 4.2 and sse 2 strcmp and strlen |
| Author |
|
|
Ali.Z 10 Jun 2024, 22:27
sse 4.2:
https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 sse 2: https://github.com/aklomp/sse-strings/blob/master/src/strcmpeq_sse2.asm same repo also contain sse 4.2, but much more compact than the above sse 4.2 implementation. _________________ Asm For Wise Humans |
|||
|
|
AsmGuru62 11 May 2025, 22:59
It looks like this function detects if strings are equal or not, so it cannot be used for sorting strings.
But that is OK, still a good implementation. Once someone posted a function to get STRLEN with the code using 4 byte comparison at a time. I decided to measure if REPNE SCASB will be faster. Curious fact, but REPNE SCASB principle was slower by around 30%! Just interesting stuff! The function scans every character twice from the looks of it: first SCASB, then CMPSB. I would have went for a single pass over the buffers: Code: align 16 String_IsEqual: ; --------------------------------------------------------------------------- ; Input: ; rsi = pointer to buffer #1 ; rdi = pointer to buffer #2 ; Output: ; CF = 1 for EQUAL (easier to JNC/JC after CALL) ; --------------------------------------------------------------------------- xor ecx, ecx ; index into both buffers @@: mov al, [rsi + rcx] mov dl, [rdi + rcx] cmp al, dl jne .not_equal test al, al jz .equal inc ecx jmp @r .equal: stc ret .not_equal: clc ret |
|||
|
|
macomics 12 May 2025, 03:15
Code: ; Input: ; rsi = pointer to buffer #1 ; rdi = pointer to buffer #2 ; Output: ; rfl = result of comparision, rax = length ; easier to JNZ/JZ/JA/JB/etc after CALL ; Uses: rcx ; Works in both directions (DF) align 16 String_IsEqual: push rsi xor ecx, ecx push rdi sub rcx, rsi @@: lods byte [rsi] scas byte [rdi] jnz @f or al, al jnz @b @@: lea rax, [rsi+rcx] ; length (add will affect flags) pop rdi pop rsi retn |
|||
|
|
Jessé 12 May 2025, 08:51
I've tested all 3 of them, scanning 128 byte equal buffers (so they are guaranteed to go equally to the end), function is called with rdi and rsi pointers, strings in different locations, and the results are:
1st: AsmGuru62 -> 255 cycles average* 2nd: Jessé -> 570 cycles average* 3rd: macomics -> 652 cycles average* While reading the history and evolution of x86 processors some hours ago, I've figured out that string instructions are heavy weight on the early models, and, indeed, this continue to be slightly true nowadays. Tests were on my AMD Ryzen 7 4800HS, soon I'll test on my Intel (Core i5 and Core i7) too see how they differ. Cycle by cycle, Intel usually wins most of the time whenever instructions went outside the basics. The measurements are related do TSC, clock is limited to 3608 MHz maximum. Well done, AsmGuru62! I'll review this against the SSE4.x and SSE2 codes and come back here sooner. * the tests were made under ring3, multitask environment, 1 single thread forced on the same CPU, 200 million loops. Edit: I must add that testing on a multitask environment leads to coarse measurement. So, this should not be accounted to be 100% precise. |
|||
|
|
Roman 12 May 2025, 09:08
Code: eqTxt: xor ecx,ecx ; index into both buffers xor edx,edx @@: mov al, [rsi + rcx] ;we can compare rax(8 bytes) or eax(4 bytes) or ax(2 bytes) cmp al, [rdi + rcx] jne .not_equal inc ecx cmp al, ' ' ;if any symbol 13,10,32,9=tab,0 done. ja @b .equal: inc dl .not_equal: ret txt db 'classics movies' find db 'classics ' ;in code mov rsi,txt mov rdi,find call eqTxt cmp dl,1 jnz @f MsgBox 'equal' @@: Last edited by Roman on 13 May 2025, 07:04; edited 3 times in total |
|||
|
|
revolution 12 May 2025, 09:57
If you use sub al, [address] instead of CMP then you can return AL directly and the value in AL will indicate the difference above=+/below=-/equal=0, and the vaue in RCX can be the position of difference. Can be useful for sorting.
|
|||
|
|
Roman 12 May 2025, 10:27
Quote: sub al , cmp, ecx In good case we needed one asm command cmptxt And not does not spoil the registers. Only set z flag. Code: symbol equ ! cmptxt txt1,txt2,symbol ;exit if 0 or symbol , or . or | or @ or # or space lazy Intel. Last edited by Roman on 12 May 2025, 11:07; edited 5 times in total |
|||
|
|
Jessé 12 May 2025, 10:30
In the meantime, I've decided also include libc strcmp() function in my tests. And, the results were horrible (for us!)... 😂
The libc function wins by being 30 ~ 40 x faster than our challengers here. But there's a catch: in my system, libc is using AVX2 to do stuff, and, indeed, it uses a very clever way of comparing strings, using vpcmpeqb, also comparing them to 0 to locate the null termination, and using tzcnt to locate end of string. Also it has some weirdness in it, because they just jump to a block that is equal to the AVX2 processing block, but in a different location, 3 times! They also get length out of it, I suppose. Its win is because, in every chunk, it can process 32 bytes of data. So, if you multiply by 32, the results were similar to what we've reached here, in a per byte basis. But a win is a win! And this means the guy doing libc C code probably eats AVX in his breakfast every day! 🤓 I'll try to create a version of it using SSEx myself, before I read the docs above. Just for fun, and self-taught learning. Debugging is life, guys! "In tempo": I'm quite used to see weird jumps in C generated code, and know this is a compiler thing, rather than been programactically intentional. |
|||
|
|
Roman 12 May 2025, 10:43
Some users not have cpu with avx2
Good solution check CPU avx2,avx,sse support. And have several functions for sse,avx,avx2 and simple variant. cmpTxtProc dd eqTxt ;if cpu supported avx2 set this to avx2cmpStrings ;in code do call [cmpTxtProc ] |
|||
|
|
revolution 12 May 2025, 10:51
All code needs to be tuned to the CPU/GPU and the data set it is run on.
There is no generic code that can outperform custom tuned code specifically designed for its narrow purpose. |
|||
|
|
Ali.Z 12 May 2025, 10:58
Roman wrote: Some users not have cpu with avx2 correct, SSEx should be bare minimum IMO. Windows 2000 had support for SSE2, Windows 98 had support for SSE1. _________________ Asm For Wise Humans |
|||
|
|
Jessé 12 May 2025, 11:02
Yep. I guess libc detects and tunes itself accordingly to what is available on every system.
Or maybe the dynamic linker does it beforehand, but, this is unknown territory for me for now. I can only speculate about. But have in mind that cpuid is a heavy instruction, so, put it on a startup and set some flags in a variable to which is avaible will be a better approach than calling it to check for things frequently. In my tests, it consumes between 100 to 200 cycles (again based in TSC) for every leaf. I usually do this in all my applications for cpuid features. |
|||
|
|
macomics 12 May 2025, 11:23
revolution wrote: If you use sub al, [address] instead of CMP then you can return AL directly and the value in AL will indicate the difference above=+/below=-/equal=0, and the vaue in RCX can be the position of difference. Can be useful for sorting. Code: strcmp: or rcx, -1 @@: inc rcx mov al, [rsi + rcx] cmp al, [rdi + rcx] jne @f cmp al, 0 ;if any symbol jnz @b @@: retn lea rsi, [str0] lea rdi, [str1] call strcmp ja .strabove jc .strbelow .strequal: ... .strabove: ... .strbelow: ... |
|||
|
|
macomics 12 May 2025, 11:35
Jessé wrote: But have in mind that cpuid is a heavy instruction, so, put it on a startup and set some flags in a variable to which is avaible will be a better approach than calling it to check for things frequently. In my tests, it consumes between 100 to 200 cycles (again based in TSC) for every leaf. Here Roman has already shown you how to do it better instead of using flags. After cpuid, you can simply configure a set of variables to perform indirect function calls. Roman wrote:
|
|||
|
|
wht36 09 Dec 2025, 05:26
For a basic memcmp without needing to know which is bigger, I use this:
Code: align 16 memcmp: ; RCX left, RDX right, R8 length ; simple memcmp only returns match/no match ; returns RAX 0 if match; r9 destroyed xor r9,r9 .cmp: cmp r9,r8 jnb .eq vmovdqu ymm0, [rcx+r9] ; loop comparing 32 bytes vpcmpeqb ymm0, ymm0, [rdx+r9] ; compare 32 bytes, set to corresponding byte to 1 if equal vpmovmskb eax, ymm0 ; convert result to byte mask to find position of mismatch xor eax, -1 ; not eax would not set flags jnz .diff ; difference found add r9,32 ; match, try compare next 32 bytes jmp .cmp .diff: tzcnt eax, eax ; get position of first bit set add eax,r9d cmp eax,r8d ; is mismatch within length? jbe .ret .eq: xor eax,eax .ret: vzeroupper ; zero upper bits of ymm and zmm registers ret |
|||
|
|
bitRAKE 09 Dec 2025, 08:21
Plug your memcmp into this test program - there are three error cases:
Code: format PE64 console entry start include 'win64ax.inc' section '.data' data readable writeable ; Simulate buffers on page boundaries to test strict alignment safety: align 4096 buf1 db 4096 dup 'A' buf2 db 4096 dup 'A' msg_pass db 'PASS', 13, 10, 0 msg_fail db 'FAIL (Expected match=%d, Got RAX=%d)', 13, 10, 0 msg_head db 'Test Len: %-3d | Diff @ %-3d : ', 0 section '.text' code readable executable start: sub rsp, 40 ; Shadow space ; --------------------------------------------------------- ; Macro: run_test ; params: length, diff_index (-1 for exact match) ; --------------------------------------------------------- macro run_test len, diff_idx { ; Declare all jump targets as local to this expansion local fail, pass, done ; 1. Print Header lea rcx, [msg_head] mov rdx, len mov r8, diff_idx invoke printf, rcx, rdx, r8 ; 2. Inject Mismatch if diff_idx >= 0 mov byte [buf2 + diff_idx], 'B' end if ; 3. Setup Call lea rcx, [buf1] lea rdx, [buf2] mov r8, len call memcmp_avx2 mov rbx, rax ; 4. Restore Buffer if diff_idx >= 0 mov byte [buf2 + diff_idx], 'A' end if ; 5. Verify Logic if diff_idx = -1 test rbx, rbx jz pass else test rbx, rbx jnz pass end if fail: lea rcx, [msg_fail] mov rdx, diff_idx mov r8, rbx invoke printf jmp done pass: invoke printf, msg_pass done: } ; --- Test Suite --- ; 1. Zero Length run_test 0, -1 ; Should Match ; 2. Small Buffers (< 32, Scalar Path) run_test 10, -1 ; Match run_test 10, 0 ; Diff @ 0 (Catches the "tzcnt returns 0" bug) run_test 10, 9 ; Diff @ End ; 3. Exact Vector Size (32 bytes) run_test 32, -1 ; Match run_test 32, 31 ; Diff @ End ; 4. Large Buffers (> 32, AVX Path + Tail) run_test 64, -1 ; Match run_test 64, 0 ; Diff @ Start (AVX path) run_test 64, 32 ; Diff @ Middle run_test 64, 63 ; Diff @ Tail (Overlapping load check) ; 5. Page Boundary Safety Simulation ; Testing a 5-byte read at the very end of a 4096-byte page. ; If the AVX path is wrongly taken, this would read past buf1 into buf2 (or crash). invoke printf, <"Boundary Check (Len 5 at offset 4091): ", 0> lea rcx, [buf1 + 4096 - 5] lea rdx, [buf2 + 4096 - 5] mov r8, 5 call memcmp_avx2 test rax, rax jz .safe_pass invoke printf, msg_fail, -1, rax jmp .exit .safe_pass: invoke printf, msg_pass .exit: invoke ExitProcess, 0 ; ... ; Paste memcmp_avx2 here ; ... section '.idata' import data readable writeable library kernel32,'KERNEL32.DLL', msvcrt,'MSVCRT.DLL' include 'api\kernel32.inc' import msvcrt, printf, 'printf' _________________ ¯\(°_o)/¯ AI may [not] have aided with the above reply. |
|||
|
|
wht36 09 Dec 2025, 14:16
Ah thanks, hopefully fixed the bugs now but still no boundary check so can cause page fault.
Code: memcmp_avx2: ; RCX left, RDX right, R8 length ; simple memcmp only returns match/no match ; returns RAX 0 if match; r9 destroyed xor r9,r9 .cmp: cmp r9,r8 jnb .eq vmovdqu ymm0, [rcx+r9] ; loop comparing 32 bytes vpcmpeqb ymm0, ymm0, [rdx+r9] ; compare 32 bytes, set to corresponding byte to 1 if equal vpmovmskb eax, ymm0 ; convert result to byte mask to find position of mismatch xor eax, -1 ; not eax would not set flags jnz .diff ; difference found add r9,32 ; match, try compare next 32 bytes jmp .cmp .eq: xor eax,eax vzeroupper ; zero upper bits of ymm and zmm registers ret .diff: tzcnt eax, eax ; get position of first bit set add eax,r9d cmp eax,r8d ; is mismatch within length? jae .eq or al,1 vzeroupper ; zero upper bits of ymm and zmm registers ret |
|||
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.