flat assembler
Message board for the users of flat assembler.
Index
> Windows > Don't want advice on optimizing Goto page 1, 2 Next |
Author |
|
revolution 16 Apr 2022, 08:28
Without knowing the characteristics of the target systems, and the strings in question, it is difficult to know what to suggest.
If the runtime is memory bound then optimise for fewer memory accesses. Perhaps make a LUT for each character of the search string, to make a skip table or something. Try to make better use of caches and other in-CPU buffers. In general it is hard to say really. But whatever you do, always test the final performance with real data in a real application. Synthetic tests are mostly not relevant except in some very special circumstances. Also, make sure to profile the final app to see where the real bottlenecks are, and not just assume a particular function is the hot spot. Wasting time optimising the wrong place is the worst. |
|||
16 Apr 2022, 08:28 |
|
AE 16 Apr 2022, 08:57
revolution, Thanks.
revolution wrote: In general it is hard to say really. But let's skip the general optimization tips, they are known. There is more than enough background details to advise on this particular implementation. |
|||
16 Apr 2022, 08:57 |
|
revolution 16 Apr 2022, 10:06
AE wrote: There is more than enough background details to advise on this particular implementation. Code runs at different speeds depending upon the CPU type, the mobo, the RAM, the OS, other tasks running, the cache state, etc. There are too many unknowns. Someone could optimise that one example above, for their system, and other people get different results, on their other systems. Another person could add other search examples, and get opposite performance just because the strings have changed. |
|||
16 Apr 2022, 10:06 |
|
AE 16 Apr 2022, 10:28
revolution wrote: With only a single example string and search pattern given I don't think it will help for any real application. Unless the real app only uses that single string and search pattern. 1. substring always = 16 bytes and contain only english chars [A-Z]|[a-z] 2. string always = 16-384 bytes and contain [A-Z]|[a-z]|[\] We can confidently presume that the test application almost completely reflects the real context. revolution wrote: CPU type, the mobo, the RAM, the OS, other tasks running, the cache state, etc As for the rest, you're being a bit disingenuous here, applications are not written for any particular PC and the basics of code optimization is quite general. It makes absolutely no difference how much memory you have on your PC to find a string of 8 characters in a string of 192 characters, isn't it? |
|||
16 Apr 2022, 10:28 |
|
revolution 16 Apr 2022, 11:57
RAM speed has an effect on code performance. But also RAM size. If other tasks are taking memory, perhaps there a lot of swapping, or TLB misses, or cache misses, etc.
CPU has an effect. An AMD E350 will give different performance results than an Intel 10900X, you will need to use different instructions to get the best performance for each. 32-bit vs 64-bit OS has an effect. Some code runs better in 64-bit, some is better in 32-bit, but the OS used can limit the choice and force one to use slower code. Other OS tasks can have higher priorities causing starvation of runtime for other apps. The length of the string matters for cache performance. If it is always one byte more than the cache line width then you can get bad performance. If it is some mixture of variable lengths, then the performance will change depending upon the nature of the input data. |
|||
16 Apr 2022, 11:57 |
|
DimonSoft 16 Apr 2022, 14:50
revolution wrote: RAM speed has an effect on code performance. But also RAM size. If other tasks are taking memory, perhaps there a lot of swapping, or TLB misses, or cache misses, etc. Since the app in question cannot influence these anyway, it’s just a matter of talking about which solution would be better for low-memory conditions. Generally having more free memory just improves the situation. revolution wrote: CPU has an effect. An AMD E350 will give different performance results than an Intel 10900X, you will need to use different instructions to get the best performance for each. There are code differences that cause same or different direction of performance change. If some change that works for improving performance for any hardware, it’s worth discussing. If the performance change varies, it’s worth discussing in a conditional manner. revolution wrote: 32-bit vs 64-bit OS has an effect. Some code runs better in 64-bit, some is better in 32-bit, but the OS used can limit the choice and force one to use slower code. The code provided by the topic-starter explicitly sets up the code bitness. The rest is not controllable by an application. revolution wrote: The length of the string matters for cache performance. If it is always one byte more than the cache line width then you can get bad performance. If it is some mixture of variable lengths, then the performance will change depending upon the nature of the input data. Which gives a hint that it’s worth trying to put the data in such a way that it takes as few cache lines as possible without algorithm degradation. And see if the increased code complexity is worth it. Yes, cache line size differs for different processors. No, generic tips are still worth it, like “if the data pieces are next to each other, it generally has better performance impact”. Yes, depending on how the cache is implemented and how the cache lines have been used before the exact numbers may increase or decrease but the factors are not controllable, so only the general direction is what really matters. P.S. But, of course, it makes such a conversation just a collection of quotes from optimization manuals. Still, reading optimization manual is one thing but applying the information in particular case is another one. I guess, this case is the second one. |
|||
16 Apr 2022, 14:50 |
|
revolution 16 Apr 2022, 18:59
DimonSoft wrote: Generally ... for any hardware ... generic ... the general direction is what really matters ... AE wrote: But let's skip the general optimization tips, they are known. Specific cases are impossible to apply if you don't know the target system, or the nature of the source data. We don't optimise the code, that is just a consequence of the process. We optimise for the system, the entire system, including the hardware, OS, usage, data, etc. Any change in those and we need to re-optimise, or at the very least re-test to make sure it is still optimal on the new system. No single piece of code will be optimal in all cases. My system might require different code than your system. We won't know until the full app is up and running, and we can test it. |
|||
16 Apr 2022, 18:59 |
|
macomics 16 Apr 2022, 19:24
Code: ; Search (case insensitive) and replace all substrings in a string ; for the sake of speed, the procedure is intentionally limited at some points ; ; rcx = pointer to string ; rdx = length of string (bytes) can be in range 16~384 ; r8 = pointer to substring ; r9 = length of substring (always = 16 bytes here) |
|||
16 Apr 2022, 19:24 |
|
revolution 16 Apr 2022, 19:34
macomics wrote: The use of Unicode strings in Windows ... Unicode is not a string format. Something like UTF-8, or ASCII, are. You need to define an encoding before you can use Unicode. |
|||
16 Apr 2022, 19:34 |
|
macomics 16 Apr 2022, 19:41
revolution wrote: Unicode is not a string format. Something like UTF-8, or ASCII, are. You need to define an encoding before you can use Unicode. |
|||
16 Apr 2022, 19:41 |
|
bitRAKE 16 Apr 2022, 20:41
SIMD can work effectively here by loading blocks of characters and converting to the case of the substring to match -- we want to increase the constant factors in the code, moving from unknown-to-known. Two comparison/branch per character is the base (i.e. slow) case.
String operations like MOVSQ should be used in two cases: size optimization, and with REP* prefix of known large value (modern processors optimize this case, test). Case sensitive matching might benefit from REPNZ SCASQ if the likelihood of matching is low/singular. When the structure of the code is optimized you will have one loop and one exit condition (predictable). Within the main loop will be another branch (unpredictable) around the match case. [fasmg] Replace use of "string". _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup Last edited by bitRAKE on 16 Apr 2022, 21:28; edited 1 time in total |
|||
16 Apr 2022, 20:41 |
|
revolution 16 Apr 2022, 20:58
bitRAKE wrote: When the structure of the code is optimized you will have one loop and one exit condition (predictable). Within the main loop will be another branch (unpredictable) around the match case. Indeed for the example given a linear search is sub-optimal since the locations are known at compile time, we can simply do a fixed position replacement. |
|||
16 Apr 2022, 20:58 |
|
bitRAKE 16 Apr 2022, 21:33
revolution wrote: Perhaps a LUT+skip algorithm is more optimal for the actual source data. _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
16 Apr 2022, 21:33 |
|
revolution 16 Apr 2022, 21:43
bitRAKE wrote: ]The LUT+skip code would also fit the model I suggest. The match method is irrelevant. bitRAKE wrote: I'm speak generally, and didn't even consider the static case presented. AE wrote: But let's skip the general optimization tips, they are known. |
|||
16 Apr 2022, 21:43 |
|
bitRAKE 16 Apr 2022, 21:53
revolution wrote: The match method is vital, it is the heart of the task. revolution wrote: Now you also see the problem. _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
16 Apr 2022, 21:53 |
|
AE 17 Apr 2022, 09:09
Is the function written so well that no one has been able to suggest optimization even for a single instruction in it?
Apparently I deserve a beer |
|||
17 Apr 2022, 09:09 |
|
macomics 17 Apr 2022, 09:51
AE wrote: Is the function written so well that no one has been able to suggest optimization even for a single instruction in it? |
|||
17 Apr 2022, 09:51 |
|
AE 17 Apr 2022, 10:28
macomics wrote: We have written to you how to optimize it macomics wrote: use lengths in characters instead of bytes...For example: rdx = 21 (bytes) is an invalid string or a 10-character string The only truly helpful comment in the thread from bitRAKE about SIMD, but this must be tested also, in addition, the downside of using a number of simd instructions is unsupported on CPUs older than 2005 though of course this is not a priority to support it... The rest is philosophy and abstract discussing around of the optimization subject and hardware diversity. No offense |
|||
17 Apr 2022, 10:28 |
|
Furs 17 Apr 2022, 11:43
Have you considered a better algorithm? Or must you necessarily use the naive SnR algorithm?
|
|||
17 Apr 2022, 11:43 |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.