flat assembler
Message board for the users of flat assembler.
  
|  Index
      > Tutorials and Examples > sse 4.2 and sse 2 strcmp and strlen | 
| Author | 
 | 
| Ali.Z 10 Jun 2024, 22:27 sse 4.2:
 https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 sse 2: https://github.com/aklomp/sse-strings/blob/master/src/strcmpeq_sse2.asm same repo also contain sse 4.2, but much more compact than the above sse 4.2 implementation. _________________ Asm For Wise Humans | |||
|  10 Jun 2024, 22:27 | 
 | 
| AsmGuru62 11 May 2025, 22:59 It looks like this function detects if strings are equal or not, so it cannot be used for sorting strings.
 But that is OK, still a good implementation. Once someone posted a function to get STRLEN with the code using 4 byte comparison at a time. I decided to measure if REPNE SCASB will be faster. Curious fact, but REPNE SCASB principle was slower by around 30%! Just interesting stuff! The function scans every character twice from the looks of it: first SCASB, then CMPSB. I would have went for a single pass over the buffers: Code: align 16 String_IsEqual: ; --------------------------------------------------------------------------- ; Input: ; rsi = pointer to buffer #1 ; rdi = pointer to buffer #2 ; Output: ; CF = 1 for EQUAL (easier to JNC/JC after CALL) ; --------------------------------------------------------------------------- xor ecx, ecx ; index into both buffers @@: mov al, [rsi + rcx] mov dl, [rdi + rcx] cmp al, dl jne .not_equal test al, al jz .equal inc ecx jmp @r .equal: stc ret .not_equal: clc ret | |||
|  11 May 2025, 22:59 | 
 | 
| macomics 12 May 2025, 03:15 Code: ; Input: ; rsi = pointer to buffer #1 ; rdi = pointer to buffer #2 ; Output: ; rfl = result of comparision, rax = length ; easier to JNZ/JZ/JA/JB/etc after CALL ; Uses: rcx ; Works in both directions (DF) align 16 String_IsEqual: push rsi xor ecx, ecx push rdi sub rcx, rsi @@: lods byte [rsi] scas byte [rdi] jnz @f or al, al jnz @b @@: lea rax, [rsi+rcx] ; length (add will affect flags) pop rdi pop rsi retn | |||
|  12 May 2025, 03:15 | 
 | 
| Jessé 12 May 2025, 08:51 I've tested all 3 of them, scanning 128 byte equal buffers (so they are guaranteed to go equally to the end), function is called with rdi and rsi pointers, strings in different locations, and the results are:
 1st: AsmGuru62 -> 255 cycles average* 2nd: Jessé -> 570 cycles average* 3rd: macomics -> 652 cycles average* While reading the history and evolution of x86 processors some hours ago, I've figured out that string instructions are heavy weight on the early models, and, indeed, this continue to be slightly true nowadays. Tests were on my AMD Ryzen 7 4800HS, soon I'll test on my Intel (Core i5 and Core i7) too see how they differ. Cycle by cycle, Intel usually wins most of the time whenever instructions went outside the basics. The measurements are related do TSC, clock is limited to 3608 MHz maximum. Well done, AsmGuru62! I'll review this against the SSE4.x and SSE2 codes and come back here sooner. * the tests were made under ring3, multitask environment, 1 single thread forced on the same CPU, 200 million loops. Edit: I must add that testing on a multitask environment leads to coarse measurement. So, this should not be accounted to be 100% precise. | |||
|  12 May 2025, 08:51 | 
 | 
| Roman 12 May 2025, 09:08 Code: eqTxt: xor ecx,ecx ; index into both buffers xor edx,edx @@: mov al, [rsi + rcx] ;we can compare rax(8 bytes) or eax(4 bytes) or ax(2 bytes) cmp al, [rdi + rcx] jne .not_equal inc ecx cmp al, ' ' ;if any symbol 13,10,32,9=tab,0 done. ja @b .equal: inc dl .not_equal: ret txt db 'classics movies' find db 'classics ' ;in code mov rsi,txt mov rdi,find call eqTxt cmp dl,1 jnz @f MsgBox 'equal' @@: Last edited by Roman on 13 May 2025, 07:04; edited 3 times in total | |||
|  12 May 2025, 09:08 | 
 | 
| revolution 12 May 2025, 09:57 If you use sub al, [address] instead of CMP then you can return AL directly and the value in AL will indicate the difference above=+/below=-/equal=0, and the vaue in RCX can be the position of difference. Can be useful for sorting. | |||
|  12 May 2025, 09:57 | 
 | 
| Roman 12 May 2025, 10:27 Quote: sub al , cmp, ecx In good case we needed one asm command cmptxt   And not does not spoil the registers. Only set z flag. Code: symbol equ ! cmptxt txt1,txt2,symbol ;exit if 0 or symbol , or . or | or @ or # or space lazy Intel. Last edited by Roman on 12 May 2025, 11:07; edited 5 times in total | |||
|  12 May 2025, 10:27 | 
 | 
| Jessé 12 May 2025, 10:30 In the meantime, I've decided also include libc strcmp() function in my tests. And, the results were horrible (for us!)... 😂
 The libc function wins by being 30 ~ 40 x faster than our challengers here. But there's a catch: in my system, libc is using AVX2 to do stuff, and, indeed, it uses a very clever way of comparing strings, using vpcmpeqb, also comparing them to 0 to locate the null termination, and using tzcnt to locate end of string. Also it has some weirdness in it, because they just jump to a block that is equal to the AVX2 processing block, but in a different location, 3 times! They also get length out of it, I suppose. Its win is because, in every chunk, it can process 32 bytes of data. So, if you multiply by 32, the results were similar to what we've reached here, in a per byte basis. But a win is a win! And this means the guy doing libc C code probably eats AVX in his breakfast every day! 🤓 I'll try to create a version of it using SSEx myself, before I read the docs above. Just for fun, and self-taught learning. Debugging is life, guys! "In tempo": I'm quite used to see weird jumps in C generated code, and know this is a compiler thing, rather than been programactically intentional. | |||
|  12 May 2025, 10:30 | 
 | 
| Roman 12 May 2025, 10:43 Some users not have cpu with avx2
 Good solution check CPU avx2,avx,sse support. And have several functions for sse,avx,avx2 and simple variant. cmpTxtProc dd eqTxt ;if cpu supported avx2 set this to avx2cmpStrings ;in code do call [cmpTxtProc ] | |||
|  12 May 2025, 10:43 | 
 | 
| revolution 12 May 2025, 10:51 All code needs to be tuned to the CPU/GPU and the data set it is run on.
 There is no generic code that can outperform custom tuned code specifically designed for its narrow purpose. | |||
|  12 May 2025, 10:51 | 
 | 
| Ali.Z 12 May 2025, 10:58 Roman wrote: Some users not have cpu with avx2 correct, SSEx should be bare minimum IMO. Windows 2000 had support for SSE2, Windows 98 had support for SSE1. _________________ Asm For Wise Humans | |||
|  12 May 2025, 10:58 | 
 | 
| Jessé 12 May 2025, 11:02 Yep. I guess libc detects and tunes itself accordingly to what is available on every system.
 Or maybe the dynamic linker does it beforehand, but, this is unknown territory for me for now. I can only speculate about. But have in mind that cpuid is a heavy instruction, so, put it on a startup and set some flags in a variable to which is avaible will be a better approach than calling it to check for things frequently. In my tests, it consumes between 100 to 200 cycles (again based in TSC) for every leaf. I usually do this in all my applications for cpuid features. | |||
|  12 May 2025, 11:02 | 
 | 
| macomics 12 May 2025, 11:23 revolution wrote: If you use sub al, [address] instead of CMP then you can return AL directly and the value in AL will indicate the difference above=+/below=-/equal=0, and the vaue in RCX can be the position of difference. Can be useful for sorting. Code: strcmp: or rcx, -1 @@: inc rcx mov al, [rsi + rcx] cmp al, [rdi + rcx] jne @f cmp al, 0 ;if any symbol jnz @b @@: retn lea rsi, [str0] lea rdi, [str1] call strcmp ja .strabove jc .strbelow .strequal: ... .strabove: ... .strbelow: ... | |||
|  12 May 2025, 11:23 | 
 | 
| macomics 12 May 2025, 11:35 Jessé wrote: But have in mind that cpuid is a heavy instruction, so, put it on a startup and set some flags in a variable to which is avaible will be a better approach than calling it to check for things frequently. In my tests, it consumes between 100 to 200 cycles (again based in TSC) for every leaf. Here Roman has already shown you how to do it better instead of using flags. After cpuid, you can simply configure a set of variables to perform indirect function calls. Roman wrote: 
 | |||
|  12 May 2025, 11:35 | 
 | 
| < Last Thread | Next Thread > | 
| Forum Rules: 
 | 
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.