flat assembler
Message board for the users of flat assembler.
Index
> Windows > Don't want advice on optimizing Goto page Previous 1, 2 |
Author |
|
macomics 17 Apr 2022, 12:38
AE wrote: The only truly helpful comment in the thread from bitRAKE about SIMD, but this must be tested also, in addition, the downside of using a number of simd instructions is unsupported on CPUs older than 2005 though of course this is not a priority to support it... |
|||
17 Apr 2022, 12:38 |
|
AE 18 Apr 2022, 14:11
macomics wrote: If you are writing a 64-bit program, then SSE3 can be used at least. Even if there is an ancient dinosaur that supports 64-bit code and does not support SSE3, it is unlikely that it will use your code. I've checked and can safely use instructions up to SSE3. Suddenly I couldn't find a single relevant example on the web (there was some SSE4.2\AVX but not for SSE2) So I had to make it from scratch according to my own understanding of how it should work. The first test version uses SSE instructions completely useless because the step stay the same (2b), but even so, it was a little faster than the standard approach. The second version is now with 16 bytes stepping, although it requires preloading four! masks instead of two, however, the speed is several times faster. It's far from perfect, and I could use a couple of tips. The main problem - there is a bug, if the size of the string changes the second algo begins to produce a small inaccuracy. This can easily be checked by changing in code '184' on '185' etc. Code: Loops: 40000 StrStrIW......................................663.93ms (From ntdll) SISD...........................................11.00ms (Version from the first post) SIMD (dumb algo)................................9.00ms SIMD............................................2.00ms Code: format PE64 console entry start include 'win64w.inc' section '.text' code readable executable start: sub rsp, 60h cinvoke wprintf, <'%s'>, 'String Addr: ' cinvoke wprintf, <'%d Size: '>, string cinvoke wprintf, <'%d bytes',13,10>, stringlength call Search mov [retval], rax cinvoke wprintf, <'%s'>, 'Search1: ' cinvoke wprintf, <'Addr: 0x%llx Pos: '>, [retval] sub [retval], string shr [retval], 1h cinvoke wprintf, <'%d',13,10>, [retval] call Search2 mov [retval], rax cinvoke wprintf, <'%s'>, 'Search2: ' cinvoke wprintf, <'Addr: 0x%llx Pos: '>, [retval] sub [retval], string shr [retval], 1h cinvoke wprintf, <'%d'>, [retval] cinvoke getch invoke ExitProcess,0 Search: ; (dumb algo) mov rcx, string mov r10, stringlength mov rdx, PatU mov r15, PatL mov r8, rcx lea r10, [rcx+r10-16] movdqu xmm1, [rdx] movdqu xmm2, [r15] @@SRLoop: cmp r8, r10 ja @@Exit_ movdqu xmm0, [r8+2] movdqu xmm3, xmm0 add r8, 2h pcmpeqw xmm0, xmm1 pcmpeqw xmm3, xmm2 pxor xmm0, xmm3 pmovmskb eax, xmm0 xor eax, 0x0000FFFF test eax, eax jnz @@SRLoop @@Found_: bsf eax, eax mov r15, rax lea rax, [r8+r15] ret @@Exit_: xor rax, rax ret Search2: ; V2 mov r8, string ; pointer to string mov r10, stringlength ; length of string mov rdx, PatU ; substring uppercase mask mov r15, PatL ; substring lowercase mask mov r12, FCHU ; First char uppercase mask mov r13, FCHL ; First char lowercase mask lea r10, [r8+r10-16] ; Stop searching address movdqu xmm1, [r13] ; Load First char lowercase mask movdqu xmm2, [r12] ; Load First char uppercase mask movdqu xmm4, [rdx] ; Load substring uppercase mask movdqu xmm5, [r15] ; Load substring lowercase mask @SearchFirstChr: movdqu xmm0, [r8] ; Load current 16 bytes from string movdqu xmm3, xmm0 ; Duplicate it to xmm3 cmp r8, r10 ; check current position in string ja @Exit_2 ; EOS --> Not Found add r8, 16 ; Step forward +16 (substring size) pcmpeqw xmm0, xmm1 ; Compare string chunk vs first substring lcase char pcmpeqw xmm3, xmm2 ; Compare string chunk vs first substring ucase char pxor xmm0, xmm3 ; Combine to one pmovmskb eax, xmm0 ; Move byte mask to GPR 128-->8-bit test eax, eax ; If no bits are set - continue (nothing found) jz @SearchFirstChr @CharFound_: bsf eax, eax ; get position lea r15, [r8+rax-16] ; we need step back 16 (step forward has already happened) ; and add bsf offset (bad code, but I haven't come up with a neat solution yet) movdqu xmm0, [r15] ; Load current 16 bytes from string (movdqU ! unaligned here) movdqu xmm3, xmm0 ; Duplicate it to xmm3 pcmpeqw xmm0, xmm4 ; Compare string chunk vs full substring ucase mask pcmpeqw xmm3, xmm5 ; Compare string chunk vs full substring lcase mask pxor xmm0, xmm3 ; Combine to one pmovmskb eax, xmm0 ; Move byte mask to GPR 128-->8-bit xor eax, 0x0000FFFF ; If it is a complete match eax = 0x0000FFFF and xor will zero the register test eax, eax jnz @SearchFirstChr ; If any bits are set - continue @FoundMatch: bsf eax, eax ; get position mov r15, rax lea rax, [r8+r15-16] ; calc offset with match ret @Exit_2: ; Nothing xor rax, rax ret section '.data' data readable writeable ; the string string: du 184 dup('.'),'Substing',16 dup('.'),0 stringlength = $-string ; the substring PatU: du 'SUBSTING' PatL: du 'substing' FCHU: du 'SSSSSSSS' FCHL: du 'ssssssss' retval dq NULL section '.idata' import data readable library kernel32,'KERNEL32.DLL', msvcrt, 'msvcrt.dll' import msvcrt, wprintf, 'wprintf', getch, '_getch' import kernel32, ExitProcess, 'ExitProcess' |
|||
18 Apr 2022, 14:11 |
|
macomics 18 Apr 2022, 15:06
Code: ; eax = compare result @CharFound_: bsf eax, eax ; get position ----> eax = bit position lea r15, [r8+rax-16] ; we need step back 16 (step forward has already happened) ; and add bsf offset (bad code, but I haven't come up with a neat solution yet) movdqu xmm0, [r15] ; Load current 16 bytes from string (movdqU ! unaligned here) movdqu xmm3, xmm0 ; Duplicate it to xmm3 pcmpeqw xmm0, xmm4 ; Compare string chunk vs full substring ucase mask pcmpeqw xmm3, xmm5 ; Compare string chunk vs full substring lcase mask pxor xmm0, xmm3 ; Combine to one pmovmskb eax, xmm0 ; Move byte mask to GPR 128-->8-bit ; xor eax, 0x0000FFFF ; If it is a complete match eax = 0x0000FFFF and xor will zero the register ; test eax, eax cmp eax, 0x0000FFFF jnz @SearchFirstChr ; If any bits are set - continue ; rax = 0x0000000000000000 in your text @FoundMatch: ; bsf eax, eax ; get position rax = 0x0000000000000000 in your text ; mov r15, rax ; lea rax, [r8+r15-16] ; calc offset with match mov rax, r15 ; r15 already contains the required offset ret Placing unused labels in the code instead of comments is a bad job. Then you can confuse yourself: Code: @@Found_: ... @FoundMatch: |
|||
18 Apr 2022, 15:06 |
|
AE 19 Apr 2022, 04:22
So obvious...
Thank you, I will try to be more attentive |
|||
19 Apr 2022, 04:22 |
|
bitRAKE 19 Apr 2022, 07:08
Code: movdqu xmm0,[r15] ; get characters por xmm0,xmm1 ; convert to lowercase pcmpeqw xmm0,xmm2 ; match against lowercase substring pmovmskb eax,xmm0 ; result bits _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
19 Apr 2022, 07:08 |
|
bitRAKE 19 Apr 2022, 08:25
I'd also like to note that you are reading beyond the end of the string in the line after label @SearchFirstChr - this is generally considered bad form, but I do it all the time myself (so very bad ).
|
|||
19 Apr 2022, 08:25 |
|
AE 20 Apr 2022, 09:19
bitRAKE wrote: I'd also like to note that you are reading beyond the end of the string in the line after label @SearchFirstChr - this is generally considered bad form, but I do it all the time myself (so very bad ). Thx for your help! I already rewrite it (move position check before load) but nevertheless I do not see the specified error in the code above, because when we enter the procedure we subtract from the length of the string the length of the substring. At least I can't reproduce a situation in which an overrun of the buffer occurred. |
|||
20 Apr 2022, 09:19 |
|
bitRAKE 20 Apr 2022, 16:52
If you put the string on the boundary of an unallocated memory range then you would see the error.
_________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
20 Apr 2022, 16:52 |
|
macomics 20 Apr 2022, 17:15
Only if the pointer to the beginning of the string is aligned by at least 16 bytes, this will not happen.
The tail will always be in the range of the allocated memory page. |
|||
20 Apr 2022, 17:15 |
|
Furs 21 Apr 2022, 13:01
Once again I'm asking if you considered other, better algorithms than the naive one. You realize there are better algorithms for SnR out there, right? No amount of optimization on this will change its worst-case algorithm complexity.
|
|||
21 Apr 2022, 13:01 |
|
AE 21 Apr 2022, 13:50
Furs wrote: if you considered other, better algorithms than the naive one If you mean Knuth-Morris-Pratt, Boyer–Moore, Rabin–Karp etc algorithms, they all have preprocessing time and need additional memory (I guess these two points potentially guarantee a loss to naive solution in terms of speed) so I did not consider them seriously for current case, in the same time I do not have a working implementation of any of these algorithms on x64 assembler therefore I can not compare the speed to make some final conclusions... |
|||
21 Apr 2022, 13:50 |
|
Goto page Previous 1, 2 < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.