flat assembler
Message board for the users of flat assembler.

Index > Windows > Don't want advice on optimizing

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
AE



Joined: 07 Apr 2022
Posts: 29
AE
Need tips on how to optimize the procedure "SnR" for speed (or maybe for size).

Remarks
  • I intentionally did not use AVX, SSE instructions because the range of the string does not exceed 192 bytes and they will not have a significant speed advantage.
  • I have tried to document the source code as much as possible to make it easier to understand
  • For convenience, the code is presented as a working example


Code:
format PE64 console
entry start

include 'win64a.inc'

section '.text' code readable executable

start:
        sub    rsp, 20h
        mov    rcx, string
        mov    rdx, stringlength
        mov    r8,  substring
        mov    r9,  substringlength
        call   SnR

        ; Dump first 16 bytes for a replacement quick test
        cinvoke printf, <'0x%llx | '>, string
        cinvoke printf, <'%llx', 32>, qword [string]
        cinvoke printf, <'%llx', 32, '|', 32>, qword [string+8]
        xor    r15, r15
        @PLoop:
        cinvoke printf, <'%c'>, qword [string+r15]
        add    r15, 1
        cmp    r15, 15
        jb     @PLoop

        cinvoke getch
        invoke  ExitProcess,0

SnR:
        ; Search (case insensitive) and replace all substrings in a string
        ; for the sake of speed, the procedure is intentionally limited at some points
        ;
        ; rcx = pointer to string
        ; rdx = length of string (bytes)  can be in range 16~384
        ; r8  = pointer to substring
        ; r9  = length of substring (always = 16 bytes here)
        ;
        sub     r9,  2h               ; 

        ; cmp    rdx, r9              ; ommited - the size is known and string always > substring
        ; cmp    rdx, 0               ; ommited - the size is known and always > 0

        xor r10, r10 ; buff ^
        xor r11, r11 ; patt ^

        ; Enter the compare loop
    @Loop:
        movzx eax, word [rcx + r10] ; eax Instead of ax for breaking dependences
        movzx ebx, word [r8  + r11] ; ebx Instead of bx for breaking dependences

        cmp eax, ebx                ; not [xor ax, bx + jz] since it will change 'ax' and the next check (if present) will be invalid
        je  @Match
        xor eax, $20                ; | 
        xor eax, ebx                ; | Uncomment for case insensitive search (works for us since substring contains letters only [A-Z]|[a-z])
        jz  @Match                  ; |
      
    @Mismatch:
        xor r11, r11                ; reset the position in the substring (look for the first character again)
        lea r10, [r10+2h]           ; shift to the next string character  (same as add r10, 2h)
        cmp r10, rdx                ; check if we have reached the end of the string
        jb  @Loop                   ; keep searching
        jmp @NotFound               ; end of the string
      
    @Match:
        lea r10, [r10+2h]           ; shift to the next string character    (same as add r10, 2h)
        lea r11, [r11+2h]           ; shift to the next substring character (same as add r11, 2h)
        cmp r11, r9                 ; check if we have reached the end of the substring
        jae @Found
        cmp r10, rdx                ; check if we have reached the end of the string
        jb  @Loop                   ; keep searching

    @NotFound:
        xor rax, rax
        ret
      
    @Found:
        ; for each match, replace the substring 
        lea rsi, [replacement]      ; src
        lea rdi, [rcx]              ; dst
        add rdi, r10                ; Calculate the absolute address of the current search position in the string
        sub rdi, r9                 ; Shift to the beginning of the match
        movsq                       ; replace the found instance
        movsq                       ; twice to replace all 16 bytes
        jmp @Mismatch               ; keep searching



section '.data' data readable  writeable
    ; the string
    string:       du 'Substing',176 dup('x'),'Substing',0
    stringlength = $-string

    ; the substring
    substring:    du 'Substing'
    substringlength = $-substring

    ; the replacement
    replacement:  du 'REPLACED' ; always same size as substring


section '.idata' import data readable ; writeable
    library kernel32,'KERNEL32.DLL', msvcrt, 'msvcrt.dll'
    import msvcrt,   printf, 'printf', getch, '_getch'
    import kernel32, ExitProcess, 'ExitProcess'    
Post 16 Apr 2022, 08:18
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18487
Location: In your JS exploiting you and your system
revolution
Without knowing the characteristics of the target systems, and the strings in question, it is difficult to know what to suggest.

If the runtime is memory bound then optimise for fewer memory accesses. Perhaps make a LUT for each character of the search string, to make a skip table or something. Try to make better use of caches and other in-CPU buffers.

In general it is hard to say really.

But whatever you do, always test the final performance with real data in a real application. Synthetic tests are mostly not relevant except in some very special circumstances.

Also, make sure to profile the final app to see where the real bottlenecks are, and not just assume a particular function is the hot spot. Wasting time optimising the wrong place is the worst.
Post 16 Apr 2022, 08:28
View user's profile Send private message Visit poster's website Reply with quote
AE



Joined: 07 Apr 2022
Posts: 29
AE
revolution, Thanks.
revolution wrote:
In general it is hard to say really.

But let's skip the general optimization tips, they are known.
There is more than enough background details to advise on this particular implementation.
Post 16 Apr 2022, 08:57
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18487
Location: In your JS exploiting you and your system
revolution
AE wrote:
There is more than enough background details to advise on this particular implementation.
With only a single example string and search pattern given I don't think it will help for any real application. Unless the real app only uses that single string and search pattern.

Code runs at different speeds depending upon the CPU type, the mobo, the RAM, the OS, other tasks running, the cache state, etc.

There are too many unknowns. Someone could optimise that one example above, for their system, and other people get different results, on their other systems. Another person could add other search examples, and get opposite performance just because the strings have changed.
Post 16 Apr 2022, 10:06
View user's profile Send private message Visit poster's website Reply with quote
AE



Joined: 07 Apr 2022
Posts: 29
AE
revolution wrote:
With only a single example string and search pattern given I don't think it will help for any real application. Unless the real app only uses that single string and search pattern.
As mentioned in source code:
1. substring always = 16 bytes and contain only english chars [A-Z]|[a-z]
2. string always = 16-384 bytes and contain [A-Z]|[a-z]|[\]
We can confidently presume that the test application almost completely reflects the real context.
revolution wrote:
CPU type, the mobo, the RAM, the OS, other tasks running, the cache state, etc
OS - Windows x64 Win7++
As for the rest, you're being a bit disingenuous here, applications are not written for any particular PC and the basics of code optimization is quite general.
It makes absolutely no difference how much memory you have on your PC to find a string of 8 characters in a string of 192 characters, isn't it?
Post 16 Apr 2022, 10:28
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18487
Location: In your JS exploiting you and your system
revolution
RAM speed has an effect on code performance. But also RAM size. If other tasks are taking memory, perhaps there a lot of swapping, or TLB misses, or cache misses, etc.

CPU has an effect. An AMD E350 will give different performance results than an Intel 10900X, you will need to use different instructions to get the best performance for each.

32-bit vs 64-bit OS has an effect. Some code runs better in 64-bit, some is better in 32-bit, but the OS used can limit the choice and force one to use slower code.

Other OS tasks can have higher priorities causing starvation of runtime for other apps.

The length of the string matters for cache performance. If it is always one byte more than the cache line width then you can get bad performance. If it is some mixture of variable lengths, then the performance will change depending upon the nature of the input data.
Post 16 Apr 2022, 11:57
View user's profile Send private message Visit poster's website Reply with quote
DimonSoft



Joined: 03 Mar 2010
Posts: 1013
Location: Belarus
DimonSoft
revolution wrote:
RAM speed has an effect on code performance. But also RAM size. If other tasks are taking memory, perhaps there a lot of swapping, or TLB misses, or cache misses, etc.

Since the app in question cannot influence these anyway, it’s just a matter of talking about which solution would be better for low-memory conditions. Generally having more free memory just improves the situation.

revolution wrote:
CPU has an effect. An AMD E350 will give different performance results than an Intel 10900X, you will need to use different instructions to get the best performance for each.

There are code differences that cause same or different direction of performance change. If some change that works for improving performance for any hardware, it’s worth discussing. If the performance change varies, it’s worth discussing in a conditional manner.

revolution wrote:
32-bit vs 64-bit OS has an effect. Some code runs better in 64-bit, some is better in 32-bit, but the OS used can limit the choice and force one to use slower code.

Other OS tasks can have higher priorities causing starvation of runtime for other apps.

The code provided by the topic-starter explicitly sets up the code bitness. The rest is not controllable by an application.

revolution wrote:
The length of the string matters for cache performance. If it is always one byte more than the cache line width then you can get bad performance. If it is some mixture of variable lengths, then the performance will change depending upon the nature of the input data.

Which gives a hint that it’s worth trying to put the data in such a way that it takes as few cache lines as possible without algorithm degradation. And see if the increased code complexity is worth it.

Yes, cache line size differs for different processors. No, generic tips are still worth it, like “if the data pieces are next to each other, it generally has better performance impact”. Yes, depending on how the cache is implemented and how the cache lines have been used before the exact numbers may increase or decrease but the factors are not controllable, so only the general direction is what really matters.

P.S. But, of course, it makes such a conversation just a collection of quotes from optimization manuals. Still, reading optimization manual is one thing but applying the information in particular case is another one. I guess, this case is the second one.
Post 16 Apr 2022, 14:50
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18487
Location: In your JS exploiting you and your system
revolution
DimonSoft wrote:
Generally ... for any hardware ... generic ... the general direction is what really matters ...
Hmmm.
AE wrote:
But let's skip the general optimization tips, they are known.
Now you see the problem.

Specific cases are impossible to apply if you don't know the target system, or the nature of the source data.

We don't optimise the code, that is just a consequence of the process. We optimise for the system, the entire system, including the hardware, OS, usage, data, etc. Any change in those and we need to re-optimise, or at the very least re-test to make sure it is still optimal on the new system.

No single piece of code will be optimal in all cases. My system might require different code than your system. We won't know until the full app is up and running, and we can test it.
Post 16 Apr 2022, 18:59
View user's profile Send private message Visit poster's website Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 492
Location: Russia
macomics
Code:
        ; Search (case insensitive) and replace all substrings in a string
        ; for the sake of speed, the procedure is intentionally limited at some points
        ;
        ; rcx = pointer to string
        ; rdx = length of string (bytes)  can be in range 16~384
        ; r8  = pointer to substring
        ; r9  = length of substring (always = 16 bytes here)    
If you are setting a task related to strings, it is more reasonable to use lengths in characters instead of bytes. Otherwise, you complicate your own life. The use of Unicode strings in Windows is associated with certain rules. For example: rdx = 21 (bytes) is an invalid string or a 10-character string. ( r8 | rcx ) & 1 == 0 is also a requirement.
Post 16 Apr 2022, 19:24
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18487
Location: In your JS exploiting you and your system
revolution
macomics wrote:
The use of Unicode strings in Windows ...
The Windows kernel generally uses UCS-2 encoding internally.

Unicode is not a string format. Something like UTF-8, or ASCII, are. You need to define an encoding before you can use Unicode.
Post 16 Apr 2022, 19:34
View user's profile Send private message Visit poster's website Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 492
Location: Russia
macomics
revolution wrote:
Unicode is not a string format. Something like UTF-8, or ASCII, are. You need to define an encoding before you can use Unicode.
So this is also a necessary requirement. I didn't write that I specified all the conditions. Only in the source text, not exactly Windows kernel is used, but msvcrt. This library can also perform manipulations with the source data.
Post 16 Apr 2022, 19:41
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3388
Location: vpcmipstrm
bitRAKE
SIMD can work effectively here by loading blocks of characters and converting to the case of the substring to match -- we want to increase the constant factors in the code, moving from unknown-to-known. Two comparison/branch per character is the base (i.e. slow) case.

String operations like MOVSQ should be used in two cases: size optimization, and with REP* prefix of known large value (modern processors optimize this case, test). Case sensitive matching might benefit from REPNZ SCASQ if the likelihood of matching is low/singular.

When the structure of the code is optimized you will have one loop and one exit condition (predictable). Within the main loop will be another branch (unpredictable) around the match case.

[fasmg] Replace use of "string".

_________________
¯\(°_o)/¯ unlicense.org


Last edited by bitRAKE on 16 Apr 2022, 21:28; edited 1 time in total
Post 16 Apr 2022, 20:41
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18487
Location: In your JS exploiting you and your system
revolution
bitRAKE wrote:
When the structure of the code is optimized you will have one loop and one exit condition (predictable). Within the main loop will be another branch (unpredictable) around the match case.
Why? Perhaps a LUT+skip algorithm is more optimal for the actual source data.

Indeed for the example given a linear search is sub-optimal since the locations are known at compile time, we can simply do a fixed position replacement.
Post 16 Apr 2022, 20:58
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3388
Location: vpcmipstrm
bitRAKE
revolution wrote:
Perhaps a LUT+skip algorithm is more optimal for the actual source data.
The LUT+skip code would also fit the model I suggest. The match method is irrelevant. I'm speaking generally, and didn't even consider the static case presented.

_________________
¯\(°_o)/¯ unlicense.org
Post 16 Apr 2022, 21:33
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18487
Location: In your JS exploiting you and your system
revolution
bitRAKE wrote:
]The LUT+skip code would also fit the model I suggest. The match method is irrelevant.
It requires an extra setup stage. And the code layout is different, new data structures and things. The match method is vital, it is the heart of the task.
bitRAKE wrote:
I'm speak generally, and didn't even consider the static case presented.
AE wrote:
But let's skip the general optimization tips, they are known.
Now you also see the problem.
Post 16 Apr 2022, 21:43
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3388
Location: vpcmipstrm
bitRAKE
revolution wrote:
The match method is vital, it is the heart of the task.
Yes, the heart fits within the loop. Try again.

revolution wrote:
Now you also see the problem.
That you pick and choose to say whatever you want. Yeah, I know that problem.

_________________
¯\(°_o)/¯ unlicense.org
Post 16 Apr 2022, 21:53
View user's profile Send private message Visit poster's website Reply with quote
AE



Joined: 07 Apr 2022
Posts: 29
AE
Is the function written so well that no one has been able to suggest optimization even for a single instruction in it?
Apparently I deserve a beer Image
Post 17 Apr 2022, 09:09
View user's profile Send private message Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 492
Location: Russia
macomics
AE wrote:
Is the function written so well that no one has been able to suggest optimization even for a single instruction in it?
Apparently I deserve a beer Image
We have written to you how to optimize it. Just didn't write the code for you.
Post 17 Apr 2022, 09:51
View user's profile Send private message Reply with quote
AE



Joined: 07 Apr 2022
Posts: 29
AE
macomics wrote:
We have written to you how to optimize it
The only thing you wrote was
macomics wrote:
use lengths in characters instead of bytes...For example: rdx = 21 (bytes) is an invalid string or a 10-character string
And in this context I do not see any use for it since the function's input string is always correct.
The only truly helpful comment in the thread from bitRAKE about SIMD, but this must be tested also, in addition, the downside of using a number of simd instructions is unsupported on CPUs older than 2005 though of course this is not a priority to support it...
The rest is philosophy and abstract discussing around of the optimization subject and hardware diversity. No offense Wink
Post 17 Apr 2022, 10:28
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 1779
Furs
Have you considered a better algorithm? Or must you necessarily use the naive SnR algorithm?
Post 17 Apr 2022, 11:43
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.