flat assembler
Message board for the users of flat assembler.
Index
> Main > 256 possible order. What is the fastest code? Goto page Previous 1, 2, 3, 4, 5, 6 Next |
Author |
|
Fastestcodes 23 Aug 2022, 10:09
La00 Laff works
La0000 La00ff works. L00 Lff does't compile. Why? Q00 Qff works. |
|||
23 Aug 2022, 10:09 |
|
revolution 23 Aug 2022, 10:22
Works for me.
Code: L00: Lff: Code: flat assembler version 1.73.08 (4040500 kilobytes memory) 1 passes, 0 bytes. |
|||
23 Aug 2022, 10:22 |
|
macomics 23 Aug 2022, 10:34
Fastestcodes wrote: L00 Lff does't compile. Why? The name of one or more labels matches the name of the command. |
|||
23 Aug 2022, 10:34 |
|
Furs 23 Aug 2022, 13:41
donn wrote: Haven't read this entire thread, but seeing the good ol' optimization and performance arguments from revolution and furs again. |
|||
23 Aug 2022, 13:41 |
|
macomics 23 Aug 2022, 14:24
Furs wrote: Suppose you test and find out binary search is better. You use it. Later you re-test and find out linear search is better, cause stuff changed (?) or you tested differently. Now you rewrite your code? Code: uselss_cmp_chain objectively_better_binary LUTs_are_awesome computed_goto 2 passes, 1139 bytes. 2 passes, 1137 bytes. 2 passes, 1151 bytes. 3 passes, 1135 bytes. 1.89 user 1.87 user 1.88 user 1.87 user 0.00 system 0.00 system 0.00 system 0.00 system 0:01.90 elapsed 0:01.87 elapsed 0:01.88 elapsed 0:01.88 elapsed Code: LUTs_are_awesome | computed_goto 2 passes, 1151 bytes. 2 passes, 1151 bytes. | 3 passes, 1135 bytes. 2 passes, 1137 bytes. 18.86 user 18.80 user | 18.75 user 18.77 user 0.00 system 0.34 system | 0.00 system 0.28 system 0:18.88 elapsed 0:19.16 elapsed | 0:18.77 elapsed 0:19.07 elapsed ;---------------------------------------------------------------------------------------------- uselss_cmp_chain | objectively_better_binary 2 passes, 1139 bytes. 2 passes, 1141 bytes. | 2 passes, 1137 bytes. 2 passes, 1139 bytes. 18.89 user 32.81 user | 18.72 user 19.26 user 0.57 system 0.01 system | 0.75 system 0.00 system 0:19.48 elapsed 0:32.85 elapsed | 0:19.50 elapsed 0:19.28 elapsed Code: LUTs_are_awesome | computed_goto 2 passes, 1148 bytes. 2 passes, 1148 bytes. | 3 passes, 1132 bytes. 2 passes, 1134 bytes. 27.65 user 27.70 user | 27.63 user 27.68 user 0.00 system 0.96 system | 0.01 system 1.02 system 0:27.67 elapsed 0:28.70 elapsed | 0:27.66 elapsed 0:28.75 elapsed ;---------------------------------------------------------------------------------------------- uselss_cmp_chain | objectively_better_binary 2 passes, 1136 bytes. 2 passes, 1138 bytes. | 2 passes, 1134 bytes. 2 passes, 1136 bytes. 27.65 user 1:24.48 user | 32.23 user 2:15.47 user 0.00 system 0.06 system | 0.02 system 0.00 system 0:27.67 elapsed 1:24.58 elapsed | 0:32.28 elapsed 2:15.55 elapsed Furs wrote: Take a step back and look at your advice here. What did this lead to the poor coder in question to decide for himself: Which code do you write here? Which one do you choose? Any algorithm requires estimates. The estimate can be given analytically (for example, O^log(n)) and practically (for example, see test results). And the fact that new hardware is coming out doesn't force developers to create new drivers for it? |
|||
23 Aug 2022, 14:24 |
|
revolution 23 Aug 2022, 14:54
Furs wrote: You don't see the problem? Furs wrote: How many times do you rewrite this piece of code? Furs wrote: And more importantly, since you have different results everytime, you can't even pick one to satisfy them all! And for each system we can test it, and then do final assembly, and now we save money and the boss gives us all bonuses. $$$$ If you have never tested code then you won't be aware of all the awesome ways it is useful. |
|||
23 Aug 2022, 14:54 |
|
bitRAKE 23 Aug 2022, 18:05
Increase the data buffer to larger than L1 data cache - then LUT, not so good.
_________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
23 Aug 2022, 18:05 |
|
DimonSoft 23 Aug 2022, 21:18
revolution wrote: And for each system we can test it, and then do final assembly, and now we save money and the boss gives us all bonuses. $$$$ If you have never tested code then you won't be aware of all the awesome ways it is useful. And since we’ve spent like 5–10–100 times more time to write and test 5–10–100 implementations to choose from (and have then changed the overall program architecture so that the improvements achieved by analysing test results got finally lost), the boss more likely gives us a big kick out of the company. If someone asks on possible ways to improve performance, the answer to test it is the worst in terms of usefulness. To test something you need to write it. Just brute-forcing all possible implementations with tiny differences having nearly the same perormance doesn’t improve performace. Knowledge of what changes might result in significant performance impact is. This knowledge is not limited to what is written in optimization manuals. One might know a lot about RSs, ROBs, BTBs, caches and stuff but still be unable to see how to use it in their particular case. If one has never heard/thought about LUTs, they might be unable to even suggest it as one of possible implementations. So, what should they test then? Good old joke wrote: A man is flying in a hot air balloon and realizes he is lost. He reduces height and spots a man down below. He lowers the balloon further and shouts: |
|||
23 Aug 2022, 21:18 |
|
macomics 23 Aug 2022, 22:10
DimonSoft wrote:
|
|||
23 Aug 2022, 22:10 |
|
revolution 24 Aug 2022, 03:44
DimonSoft wrote: ...the boss more likely gives us a big kick out of the company. DimonSoft wrote: Knowledge of what changes might result in significant performance impact is. |
|||
24 Aug 2022, 03:44 |
|
Fastestcodes 24 Aug 2022, 04:58
Good old joke wrote: A man is flying in a hot air balloon and realizes he is lost. He reduces height and spots a man down below. He lowers the balloon further and shouts: D Code: if question=notgoodenough then answer=notgoodenough What is the name of Evi....this place? Underballoon square, replies the man. |
|||
24 Aug 2022, 04:58 |
|
sinsi 24 Aug 2022, 05:57
Quote: The man below says, "You are in a hot air balloon." He's probably (hopefully) in a basket suspended from a hot air balloon, shirley? We beat optimization to death in the masm32 forum. Simply changing the alignment of a proc from para to byte would speed it up on some CPUs sometimes. To see the madness that optimization can produce, the function for copying memory is a fun read (from the Windows SDK or VS code? can't remember) but it's hundreds of lines of asm which basically trims the start and end to force the remainder to be copied with SSE/AVX As for the original, the smallest for 256 values might be Code: dec al js lff dec al js lfe ...etc |
|||
24 Aug 2022, 05:57 |
|
Furs 24 Aug 2022, 13:14
sinsi wrote: As for the original, the smallest for 256 values might be Another classic is calling a function 100 times that does all the work versus using a loop and a table with 100 entries that provides all the information (the parameters you pass to the function etc), the latter is far better. |
|||
24 Aug 2022, 13:14 |
|
Furs 24 Aug 2022, 13:19
macomics wrote: Did the tests give such contradictory results? Maybe you'll try to analyze the results first before you continue to carry this foolishness of yours. A further test showed that two of the 4 options are objectively better than the others. So when you decide to write such algorithms you have to write both, then write tests, preferably test on many different CPUs to arrive at the same result, and then pick qsort. ...Or you could've just picked qsort from the beginning. I guess it's a good way to get paid for doing no useful work. Actually this reminds me of the pure scientific method meme who want to prove/test everything themselves. You know, can't trust what other scientists discovered, you gotta literally prove everything yourself, or it's religion. ...even if you don't have the necessary equipment. (in this case, all the different CPUs) |
|||
24 Aug 2022, 13:19 |
|
revolution 24 Aug 2022, 13:28
Quicksort is not always faster than bubble sort. The data set matters.
|
|||
24 Aug 2022, 13:28 |
|
macomics 24 Aug 2022, 14:08
Furs wrote: ...Or you could've just picked qsort from the beginning. |
|||
24 Aug 2022, 14:08 |
|
revolution 24 Aug 2022, 14:16
All sorts are slower than some other sort, and all sorts are faster than some other sort. It all depends on your usage.
The data set will dictate which sort works best. If you blindly follow the O(....) notation as the gospel truth to the bestest ever sort, then you might be missing out on something better because you never tried other options. |
|||
24 Aug 2022, 14:16 |
|
macomics 24 Aug 2022, 14:24
revolution wrote: All sorts are slower than some other sort, and all sorts are faster than some other sort. It all depends on your usage. I agree with that. I was talking about the choice when I had already come to the dilemma of which algorithm to choose, when it was already clear that a common method would have to be used for sorting. The advantage of tsort is that it performs fewer comparison operations compared to qsort and fewer permutations in general. |
|||
24 Aug 2022, 14:24 |
|
DimonSoft 24 Aug 2022, 20:52
revolution wrote:
Ha-ha, cool ad hominem case. Nothing said about the cost of writing and functionally testing multiple implementations chosen blindly before actually doing measurements (or do you always write code with no bugs for even corner cases, or measure performance for code that doesn’t yet work right?). I’d love to see the world where every boss is ready to pay for unlimited research, yet most (luckily not all) companies I’m aware of are more willing to release some good-enough-but-as-fast-as-possible-within-the-time-limits version. revolution wrote:
Yep. And then, after you’ve tested it, you make some changes in supposedly irrelevant piece of your program and — boom! — certain test results are not valid anymore due to changes to cache usage, data patterns, etc. When people ask for help in optimization they expect some guidance on (a) better algorithms/data structures, (b) possible assembly-level tricks and their corner cases, both applicable to the particular case. There should be a disclaimer that everything said might not work in particular case and should be tested, but not the disclaimer only. If one wants to help, they would better ask additional questions on what is known about the most frequent use cases for the implementation, or at least suggest some generic stuff to be tried. And “a good advice comes with a rationale”, so the suggestion goes with notes about when it might be faster and when might not. And then this is the place to say that only testing will let one know for sure. Still, you’d better also say here that testing will only let you know something about performance with this particular hardware, not about the common case for future users, but that’s at least something. Also it’s worth keeping in mind that algorithmic optimizations tend to have more significant impact than lower-level ones, so just suggesting an O(N) algorithm instead of O(N^2) might be enough to even avoid discussing low-level optimizations, unless the one who asks has all his users run some very custom CPU than executes orders more instructions faster than a few ones. Only saying that one should test (with no guidance on what possible alternatives there might be) is no better than writing “I don’t know”. |
|||
24 Aug 2022, 20:52 |
|
Goto page Previous 1, 2, 3, 4, 5, 6 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.