flat assembler
Message board for the users of flat assembler.

Index > Windows > Don't want advice on optimizing

Goto page Previous  1, 2
Author
Thread Post new topic Reply to topic
bitRAKE



Joined: 21 Jul 2003
Posts: 3387
Location: vpcmipstrm
bitRAKE
tbh, it looks like homework.

The array being capped at 384 bytes really limits the problem. No magic algorithm to save the day here - just simple optimizations. This might be what revolution was alluding to - the inherent contradiction in what is being said here.

Pass me a beer.

_________________
¯\(°_o)/¯ unlicense.org
Post 17 Apr 2022, 12:14
View user's profile Send private message Visit poster's website Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 483
Location: Russia
macomics
AE wrote:
The only truly helpful comment in the thread from bitRAKE about SIMD, but this must be tested also, in addition, the downside of using a number of simd instructions is unsupported on CPUs older than 2005 though of course this is not a priority to support it...
If you are writing a 64-bit program, then SSE3 can be used at least. Even if there is an ancient dinosaur that supports 64-bit code and does not support SSE3, it is unlikely that it will use your code.
Post 17 Apr 2022, 12:38
View user's profile Send private message Reply with quote
AE



Joined: 07 Apr 2022
Posts: 29
AE
macomics wrote:
If you are writing a 64-bit program, then SSE3 can be used at least. Even if there is an ancient dinosaur that supports 64-bit code and does not support SSE3, it is unlikely that it will use your code.

I've checked and can safely use instructions up to SSE3.
Suddenly I couldn't find a single relevant example on the web Shocked (there was some SSE4.2\AVX but not for SSE2)
So I had to make it from scratch according to my own understanding of how it should work. The first test version uses SSE instructions completely useless because the step stay the same (2b), but even so, it was a little faster than the standard approach. The second version is now with 16 bytes stepping, although it requires preloading four! masks instead of two, however, the speed is several times faster.
It's far from perfect, and I could use a couple of tips.
The main problem - there is a bug, if the size of the string changes the second algo begins to produce a small inaccuracy. This can easily be checked by changing in code '184' on '185' etc.

Code:
Loops: 40000 
StrStrIW......................................663.93ms (From ntdll)
SISD...........................................11.00ms (Version from the first post)
SIMD (dumb algo)................................9.00ms
SIMD............................................2.00ms    


Code:
format PE64 console
entry start

include 'win64w.inc'

section '.text' code readable executable

start:
        sub    rsp, 60h
        cinvoke wprintf, <'%s'>, 'String Addr: '
        cinvoke wprintf, <'%d Size: '>, string
        cinvoke wprintf, <'%d bytes',13,10>, stringlength
        call   Search
        mov    [retval], rax
        cinvoke wprintf, <'%s'>, 'Search1: '
        cinvoke wprintf, <'Addr: 0x%llx  Pos: '>, [retval]
        sub [retval], string
        shr [retval], 1h
        cinvoke wprintf, <'%d',13,10>, [retval]
        call   Search2
        mov    [retval], rax
        cinvoke wprintf, <'%s'>, 'Search2: '
        cinvoke wprintf, <'Addr: 0x%llx  Pos: '>, [retval]
        sub [retval], string
        shr [retval], 1h
        cinvoke wprintf, <'%d'>, [retval]
        cinvoke getch
        invoke  ExitProcess,0



Search: ; (dumb algo)
        mov     rcx, string
        mov     r10, stringlength
        mov     rdx, PatU
        mov     r15, PatL
        mov     r8, rcx
        lea     r10, [rcx+r10-16]
        movdqu xmm1,   [rdx]
        movdqu xmm2,   [r15]
@@SRLoop:
        cmp    r8, r10
        ja     @@Exit_
        movdqu xmm0, [r8+2]
        movdqu xmm3,  xmm0
        add r8, 2h   
        pcmpeqw xmm0, xmm1
        pcmpeqw xmm3, xmm2
        pxor   xmm0, xmm3
        pmovmskb eax, xmm0
        xor  eax, 0x0000FFFF
        test eax, eax
        jnz @@SRLoop
@@Found_:
        bsf eax, eax
        mov r15, rax
        lea rax, [r8+r15]
        ret 
@@Exit_:
        xor rax, rax
        ret


Search2: ; V2
        mov    r8,   string                ; pointer to string 
        mov    r10,  stringlength          ; length of string
        mov    rdx,  PatU                  ; substring  uppercase mask
        mov    r15,  PatL                  ; substring  lowercase mask
        mov    r12,  FCHU                  ; First char uppercase mask
        mov    r13,  FCHL                  ; First char lowercase mask

        lea    r10, [r8+r10-16]            ; Stop searching address

        movdqu xmm1,  [r13]                ; Load First char lowercase mask
        movdqu xmm2,  [r12]                ; Load First char uppercase mask
        movdqu xmm4,  [rdx]                ; Load substring  uppercase mask
        movdqu xmm5,  [r15]                ; Load substring  lowercase mask


@SearchFirstChr:
        movdqu xmm0,  [r8]                 ; Load current 16 bytes from string
        movdqu xmm3,  xmm0                 ; Duplicate it to xmm3
        cmp    r8, r10                     ; check current position in string
        ja     @Exit_2                     ; EOS --> Not Found

        add    r8, 16                      ; Step forward +16 (substring size)
        pcmpeqw xmm0, xmm1                 ; Compare string chunk vs first substring lcase char
        pcmpeqw xmm3, xmm2                 ; Compare string chunk vs first substring ucase char
        pxor    xmm0, xmm3                 ; Combine to one
        pmovmskb eax, xmm0                 ; Move byte mask to GPR 128-->8-bit

        test eax, eax                      ; If no bits are set - continue (nothing found)
        jz  @SearchFirstChr

@CharFound_:
        bsf    eax, eax                    ; get position
        lea    r15, [r8+rax-16]            ; we need step back 16 (step forward has already happened) 
        ;                                    and add bsf offset (bad code, but I haven't come up with a neat solution yet)
        movdqu  xmm0, [r15]                ; Load current 16 bytes from string (movdqU ! unaligned here)
        movdqu  xmm3, xmm0                 ; Duplicate it to xmm3
        pcmpeqw xmm0, xmm4                 ; Compare string chunk vs full substring ucase mask
        pcmpeqw xmm3, xmm5                 ; Compare string chunk vs full substring lcase mask
        pxor    xmm0, xmm3                 ; Combine to one
        pmovmskb eax, xmm0                 ; Move byte mask to GPR 128-->8-bit
        xor      eax, 0x0000FFFF           ; If it is a complete match eax = 0x0000FFFF and xor will zero the register
        test     eax, eax
        jnz @SearchFirstChr                ; If any bits are set - continue

@FoundMatch:
        bsf    eax, eax                   ; get position
        mov    r15, rax
        lea    rax, [r8+r15-16]           ; calc offset with match
        ret

@Exit_2:                                  ; Nothing
        xor rax, rax
        ret


section '.data' data readable  writeable
    ; the string
    string:       du 184 dup('.'),'Substing',16 dup('.'),0
    stringlength = $-string

    ; the substring
    PatU:          du 'SUBSTING'
    PatL:          du 'substing'
    FCHU:          du 'SSSSSSSS'
    FCHL:          du 'ssssssss'
    retval         dq NULL


section '.idata' import data readable
    library kernel32,'KERNEL32.DLL', msvcrt, 'msvcrt.dll'
    import msvcrt,   wprintf, 'wprintf', getch, '_getch'
    import kernel32, ExitProcess, 'ExitProcess'


    
Post 18 Apr 2022, 14:11
View user's profile Send private message Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 483
Location: Russia
macomics
Code:
; eax = compare result
@CharFound_:
        bsf    eax, eax                    ; get position ----> eax = bit position
        lea    r15, [r8+rax-16]            ; we need step back 16 (step forward has already happened) 
        ;                                    and add bsf offset (bad code, but I haven't come up with a neat solution yet)
        movdqu  xmm0, [r15]                ; Load current 16 bytes from string (movdqU ! unaligned here)
        movdqu  xmm3, xmm0                 ; Duplicate it to xmm3
        pcmpeqw xmm0, xmm4                 ; Compare string chunk vs full substring ucase mask
        pcmpeqw xmm3, xmm5                 ; Compare string chunk vs full substring lcase mask
        pxor    xmm0, xmm3                 ; Combine to one
        pmovmskb eax, xmm0                 ; Move byte mask to GPR 128-->8-bit
;       xor      eax, 0x0000FFFF           ; If it is a complete match eax = 0x0000FFFF and xor will zero the register
;       test     eax, eax
        cmp      eax, 0x0000FFFF
        jnz @SearchFirstChr                ; If any bits are set - continue
; rax = 0x0000000000000000 in your text
@FoundMatch:
;       bsf    eax, eax                   ; get position rax = 0x0000000000000000 in your text
;       mov    r15, rax
;       lea    rax, [r8+r15-16]           ; calc offset with match
        mov     rax, r15 ; r15 already contains the required offset
        ret    

Placing unused labels in the code instead of comments is a bad job. Then you can confuse yourself:
Code:
@@Found_:
...
@FoundMatch:
    
Post 18 Apr 2022, 15:06
View user's profile Send private message Reply with quote
AE



Joined: 07 Apr 2022
Posts: 29
AE
So obvious...
Thank you, I will try to be more attentive Smile
Post 19 Apr 2022, 04:22
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3387
Location: vpcmipstrm
bitRAKE
Code:
        movdqu          xmm0,[r15]      ; get characters
        por             xmm0,xmm1       ; convert to lowercase
        pcmpeqw         xmm0,xmm2       ; match against lowercase substring
        pmovmskb        eax,xmm0        ; result bits    
... it's not faster per se - just less code/data - same dependency chain length. XMM1 would be 0x0020 to convert characters to lowercase.

_________________
¯\(°_o)/¯ unlicense.org
Post 19 Apr 2022, 07:08
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3387
Location: vpcmipstrm
bitRAKE
I'd also like to note that you are reading beyond the end of the string in the line after label @SearchFirstChr - this is generally considered bad form, but I do it all the time myself (so very bad Embarassed).
Post 19 Apr 2022, 08:25
View user's profile Send private message Visit poster's website Reply with quote
AE



Joined: 07 Apr 2022
Posts: 29
AE
bitRAKE wrote:
I'd also like to note that you are reading beyond the end of the string in the line after label @SearchFirstChr - this is generally considered bad form, but I do it all the time myself (so very bad Embarassed).

Thx for your help!
I already rewrite it (move position check before load) but nevertheless I do not see the specified error in the code above, because when we enter the procedure we subtract from the length of the string the length of the substring. At least I can't reproduce a situation in which an overrun of the buffer occurred.
Post 20 Apr 2022, 09:19
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3387
Location: vpcmipstrm
bitRAKE
If you put the string on the boundary of an unallocated memory range then you would see the error.

_________________
¯\(°_o)/¯ unlicense.org
Post 20 Apr 2022, 16:52
View user's profile Send private message Visit poster's website Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 483
Location: Russia
macomics
Only if the pointer to the beginning of the string is aligned by at least 16 bytes, this will not happen.
The tail will always be in the range of the allocated memory page.
Post 20 Apr 2022, 17:15
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 1772
Furs
Once again I'm asking if you considered other, better algorithms than the naive one. You realize there are better algorithms for SnR out there, right? No amount of optimization on this will change its worst-case algorithm complexity.
Post 21 Apr 2022, 13:01
View user's profile Send private message Reply with quote
AE



Joined: 07 Apr 2022
Posts: 29
AE
Furs wrote:
if you considered other, better algorithms than the naive one
I am open to considering any alternative.

If you mean Knuth-Morris-Pratt, Boyer–Moore, Rabin–Karp etc algorithms, they all have preprocessing time and need additional memory (I guess these two points potentially guarantee a loss to naive solution in terms of speed) so I did not consider them seriously for current case, in the same time I do not have a working implementation of any of these algorithms on x64 assembler therefore I can not compare the speed to make some final conclusions...
Post 21 Apr 2022, 13:50
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.