flat assembler
Message board for the users of flat assembler.

Index > Main > 256 possible order. What is the fastest code?

Goto page Previous  1, 2, 3, 4, 5, 6  Next
Author
Thread Post new topic Reply to topic
Fastestcodes



Joined: 13 Jun 2022
Posts: 75
Fastestcodes 23 Aug 2022, 10:09
La00 Laff works
La0000 La00ff works.
L00 Lff does't compile. Why?
Q00 Qff works.
Post 23 Aug 2022, 10:09
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20451
Location: In your JS exploiting you and your system
revolution 23 Aug 2022, 10:22
Works for me.
Code:
L00:
Lff:    
Code:
flat assembler  version 1.73.08  (4040500 kilobytes memory)
1 passes, 0 bytes.    
Post 23 Aug 2022, 10:22
View user's profile Send private message Visit poster's website Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 1042
Location: Russia
macomics 23 Aug 2022, 10:34
Fastestcodes wrote:
L00 Lff does't compile. Why?
For example, because of Lea

The name of one or more labels matches the name of the command.
Post 23 Aug 2022, 10:34
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2565
Furs 23 Aug 2022, 13:37
revolution wrote:
But in reality you have no clue about the internal CPU state, and many other things, so any predictions are rendered invalid.
Are they?

You test on one CPU, under one system configuration, and may even mess up in the process or be influenced by other lucky factors (alignment, cache, background processes, etc as you said). Cache matters if the workload done for each item invalidates it, for example.

I write software for the general audience, or even me in the future. I can't test on future CPUs for obvious reasons, or on everyone's CPUs and systems.

revolution wrote:
Besides, what is the problem with testing? Why so much resistance to it? Is everyone simply afraid to be proved wrong? By testing we do something called science. You test your assumptions and hypotheses to determine if they are real or imagined. Otherwise you just have a religion. X is better because Furs said it is. Don't ever test it lest we show the folly of such proclamations. Laughing Optimisation will be done by fiat! SmileSmile
Testing is fine but it shouldn't be something done routinely to decide what to code, unless you have strong reason or just want it for fun. As mentioned above there's far too many factors influencing the result.

Suppose you test and find out binary search is better. You use it. Later you re-test and find out linear search is better, cause stuff changed (?) or you tested differently. Now you rewrite your code?

Then you test on another CPU and find out some other method is better.

You don't see the problem? How many times do you rewrite this piece of code? And more importantly, since you have different results everytime, you can't even pick one to satisfy them all!

Take a step back and look at your advice here. What did this lead to the poor coder in question to decide for himself: Which code do you write here? Which one do you choose?

(my take: obviously binary search because it has far less worst case no matter how skewed the data pattern is; or a LUT which also is similar)

That's the problem with testing, it's not universal result. You say it's objective, but it's not, it depends on too many factors, same as coding by "rule of thumbs". But the latter allows you to write the code, once.


Last edited by Furs on 23 Aug 2022, 13:42; edited 1 time in total
Post 23 Aug 2022, 13:37
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2565
Furs 23 Aug 2022, 13:41
donn wrote:
Haven't read this entire thread, but seeing the good ol' optimization and performance arguments from revolution and furs again.

Noticing a pattern with revolution: you seem to err on the side of 'performance is not predictable until you test it.' And you give valid arguments. Not as familiar with furs' take, but seems to be more of the opposite, that testing is less significant. I agree more with this, but am happy to change my stance.

Question for revolution: There are articles on AMD Manuals such as "Software Optimization Guide for AMD Family 17h Models 30h and Greater Processors". Do these articles have any value in your opinion, if so why or why not?

One quote from article:
Quote:
"Many factors affect instruction execution time. For instance, when a source operand must be loaded from a memory location, the time required to read the operand from system memory adds to the execution time."


Which is your point, but they detail these factors and give specific performance suggestions, such as:

Quote:
"Avoid branches/jumps in the calculation of values."

Performance is a risk assessment, no? It's 100% true that you can't know the exact performance, but you can predict probabilities based on principles, no? I've been doing a lot of Vulkan and SPIR-V lately and I keep having to learn concepts to get frame rates to render faster, like a 30 second clip rendered at 4 seconds per frame instead of it taking 24 hours total.

If I jumped into this thread under the wrong pretext, apologies, feel free to ignore.
Yeah I fully agree, there's a reason Agner Fog's optimization manuals and the official ones from the manufacturers are considered gold standard for anyone who wants to write optimized code (in any shape). Smile
Post 23 Aug 2022, 13:41
View user's profile Send private message Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 1042
Location: Russia
macomics 23 Aug 2022, 14:24
Furs wrote:
Suppose you test and find out binary search is better. You use it. Later you re-test and find out linear search is better, cause stuff changed (?) or you tested differently. Now you rewrite your code?
Code:
uselss_cmp_chain        objectively_better_binary       LUTs_are_awesome        computed_goto
2 passes, 1139 bytes.   2 passes, 1137 bytes.           2 passes, 1151 bytes.   3 passes, 1135 bytes.
   1.89 user               1.87 user                       1.88 user               1.87 user
   0.00 system             0.00 system                     0.00 system             0.00 system
0:01.90 elapsed         0:01.87 elapsed                 0:01.88 elapsed         0:01.88 elapsed    
Did the tests give such contradictory results? Maybe you'll try to analyze the results first before you continue to carry this foolishness of yours. A further test showed that two of the 4 options are objectively better than the others.
Code:
LUTs_are_awesome                                | computed_goto
2 passes, 1151 bytes.   2 passes, 1151 bytes.   | 3 passes, 1135 bytes.   2 passes, 1137 bytes.
  18.86 user              18.80 user            |   18.75 user              18.77 user
   0.00 system             0.34 system          |    0.00 system             0.28 system
0:18.88 elapsed         0:19.16 elapsed         | 0:18.77 elapsed         0:19.07 elapsed
;----------------------------------------------------------------------------------------------
uselss_cmp_chain                                | objectively_better_binary
2 passes, 1139 bytes.   2 passes, 1141 bytes.   | 2 passes, 1137 bytes.   2 passes, 1139 bytes.
  18.89 user              32.81 user            |   18.72 user              19.26 user
   0.57 system             0.01 system          |    0.75 system             0.00 system
0:19.48 elapsed         0:32.85 elapsed         | 0:19.50 elapsed         0:19.28 elapsed    
Even when increasing the length of the test combination
Code:
LUTs_are_awesome                                | computed_goto
2 passes, 1148 bytes.   2 passes, 1148 bytes.   | 3 passes, 1132 bytes.   2 passes, 1134 bytes.
  27.65 user              27.70 user            |   27.63 user              27.68 user
   0.00 system             0.96 system          |    0.01 system             1.02 system
0:27.67 elapsed         0:28.70 elapsed         | 0:27.66 elapsed         0:28.75 elapsed
;----------------------------------------------------------------------------------------------
uselss_cmp_chain                                | objectively_better_binary
2 passes, 1136 bytes.   2 passes, 1138 bytes.   | 2 passes, 1134 bytes.   2 passes, 1136 bytes.
  27.65 user            1:24.48 user            |   32.23 user            2:15.47 user
   0.00 system             0.06 system          |    0.02 system             0.00 system
0:27.67 elapsed         1:24.58 elapsed         | 0:32.28 elapsed         2:15.55 elapsed    
I understand, they were betting on an outsider. It was necessary to solve the problem that I gave you as an example.

Furs wrote:
Take a step back and look at your advice here. What did this lead to the poor coder in question to decide for himself: Which code do you write here? Which one do you choose?
Fastestcodes made the right choice in the direction of LUT, which was objectively evident from the test results.

Any algorithm requires estimates. The estimate can be given analytically (for example, O^log(n)) and practically (for example, see test results). And the fact that new hardware is coming out doesn't force developers to create new drivers for it?
Post 23 Aug 2022, 14:24
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20451
Location: In your JS exploiting you and your system
revolution 23 Aug 2022, 14:54
Furs wrote:
You don't see the problem?
There is a problem?
Furs wrote:
How many times do you rewrite this piece of code?
Once. I already posted my code. There is no need to re-write, you can add more tests to your hearts content. And then test them later on whatever system interests you.
Furs wrote:
And more importantly, since you have different results everytime, you can't even pick one to satisfy them all!
And that is why we test. To show if something is better or if it makes no difference. That is better than living in ignorance and guessing/hoping/pretending it will be fine.

And for each system we can test it, and then do final assembly, and now we save money and the boss gives us all bonuses. $$$$ Smile If you have never tested code then you won't be aware of all the awesome ways it is useful.
Post 23 Aug 2022, 14:54
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4073
Location: vpcmpistri
bitRAKE 23 Aug 2022, 18:05
Increase the data buffer to larger than L1 data cache - then LUT, not so good.

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 23 Aug 2022, 18:05
View user's profile Send private message Visit poster's website Reply with quote
DimonSoft



Joined: 03 Mar 2010
Posts: 1228
Location: Belarus
DimonSoft 23 Aug 2022, 21:18
revolution wrote:
And for each system we can test it, and then do final assembly, and now we save money and the boss gives us all bonuses. $$$$ Smile If you have never tested code then you won't be aware of all the awesome ways it is useful.

And since we’ve spent like 5–10–100 times more time to write and test 5–10–100 implementations to choose from (and have then changed the overall program architecture so that the improvements achieved by analysing test results got finally lost), the boss more likely gives us a big kick out of the company.

If someone asks on possible ways to improve performance, the answer to test it is the worst in terms of usefulness. To test something you need to write it. Just brute-forcing all possible implementations with tiny differences having nearly the same perormance doesn’t improve performace. Knowledge of what changes might result in significant performance impact is.

This knowledge is not limited to what is written in optimization manuals. One might know a lot about RSs, ROBs, BTBs, caches and stuff but still be unable to see how to use it in their particular case. If one has never heard/thought about LUTs, they might be unable to even suggest it as one of possible implementations. So, what should they test then?

Good old joke wrote:
A man is flying in a hot air balloon and realizes he is lost. He reduces height and spots a man down below. He lowers the balloon further and shouts:

"Excuse me, can you help me? Where am I?"

The man below says, "You are in a hot air balloon."

"You must be a programmer," says the balloonist.

"I am," replies the man. "How did you know?"

"Well," says the balloonist, "what you have told me is technically correct, yet totally f&#king worthless."
Post 23 Aug 2022, 21:18
View user's profile Send private message Visit poster's website Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 1042
Location: Russia
macomics 23 Aug 2022, 22:10
DimonSoft wrote:
Good old joke wrote:
A man is flying in a hot air balloon and realizes he is lost. He reduces height and spots a man down below. He lowers the balloon further and shouts:

"Excuse me, can you help me? Where am I?"

The man below says, "You are in a hot air balloon."

"You must be a programmer," says the balloonist.

"I am," replies the man. "How did you know?"

"Well," says the balloonist, "what you have told me is technically correct, yet totally f&#king worthless."
As he formulated the question, he received such an answer. If he changed the wording and added a little more details, then even a programmer could help him with solving his problem. It's all about the wrong technical task!
Post 23 Aug 2022, 22:10
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20451
Location: In your JS exploiting you and your system
revolution 24 Aug 2022, 03:44
DimonSoft wrote:
...the boss more likely gives us a big kick out of the company.
If you think I should be fired then you'll have to convince my boss with some fantastic argument, because up till now my boss is very pleased to save money. My boss doesn't trust anyone to simply proclaim something superior without proof. And rightly so, we are not a religious organisation.
DimonSoft wrote:
Knowledge of what changes might result in significant performance impact is.
Of course, but the key word there is "might". You won't know until you test it. Otherwise just guessing and merely claiming it is awesome, without any evidence, would get you a very bad review in my company.
Post 24 Aug 2022, 03:44
View user's profile Send private message Visit poster's website Reply with quote
Fastestcodes



Joined: 13 Jun 2022
Posts: 75
Fastestcodes 24 Aug 2022, 04:58
Good old joke wrote:
A man is flying in a hot air balloon and realizes he is lost. He reduces height and spots a man down below. He lowers the balloon further and shouts:

"Excuse me, can you help me? Where am I?"

The man below says, "You are in a hot air balloon."

"You must be a programmer," says the balloonist.

"I am," replies the man. "How did you know?"

"Well," says the balloonist, "what you have told me is technically correct, yet totally f&#king worthless."

Very HappyD
Code:
if question=notgoodenough then answer=notgoodenough    

What is the name of Evi....this place?
Underballoon square, replies the man.Smile
Post 24 Aug 2022, 04:58
View user's profile Send private message Reply with quote
sinsi



Joined: 10 Aug 2007
Posts: 794
Location: Adelaide
sinsi 24 Aug 2022, 05:57
Quote:
The man below says, "You are in a hot air balloon."

He's probably (hopefully) in a basket suspended from a hot air balloon, shirley?

We beat optimization to death in the masm32 forum. Simply changing the alignment of a proc from para to byte would speed it up on some CPUs sometimes.

To see the madness that optimization can produce, the function for copying memory is a fun read (from the Windows SDK or VS code? can't remember) but it's hundreds of lines of asm which basically trims the start and end to force the remainder to be copied with SSE/AVX Rolling Eyes

As for the original, the smallest for 256 values might be
Code:
    dec al
    js lff
    dec al
    js lfe
    ...etc
    
Post 24 Aug 2022, 05:57
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2565
Furs 24 Aug 2022, 13:14
sinsi wrote:
As for the original, the smallest for 256 values might be
A lookup table is smaller. For size, if you can, it's usually best to use data-driven approaches rather than code-based (instruction) approaches, because instruction encodings are extra space.

Another classic is calling a function 100 times that does all the work versus using a loop and a table with 100 entries that provides all the information (the parameters you pass to the function etc), the latter is far better.
Post 24 Aug 2022, 13:14
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2565
Furs 24 Aug 2022, 13:19
macomics wrote:
Did the tests give such contradictory results? Maybe you'll try to analyze the results first before you continue to carry this foolishness of yours. A further test showed that two of the 4 options are objectively better than the others.
I was talking to revolution and using a hypothetical example. No matter what he'll always ask to test, he'll ask to test even bubble sort vs qsort on 100000 items (or equivalent algorithms).

So when you decide to write such algorithms you have to write both, then write tests, preferably test on many different CPUs to arrive at the same result, and then pick qsort.

...Or you could've just picked qsort from the beginning.

I guess it's a good way to get paid for doing no useful work.



Actually this reminds me of the pure scientific method meme who want to prove/test everything themselves. You know, can't trust what other scientists discovered, you gotta literally prove everything yourself, or it's religion. Wink

...even if you don't have the necessary equipment. (in this case, all the different CPUs)
Post 24 Aug 2022, 13:19
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20451
Location: In your JS exploiting you and your system
revolution 24 Aug 2022, 13:28
Quicksort is not always faster than bubble sort. The data set matters.
Post 24 Aug 2022, 13:28
View user's profile Send private message Visit poster's website Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 1042
Location: Russia
macomics 24 Aug 2022, 14:08
Furs wrote:
...Or you could've just picked qsort from the beginning.
And here you are cheating on yourself for some reason. I'll probably follow my experience and choose tsort, which often turns out to be faster than qsort. Just in sorting approaches, a binary tree is just the best choice, but with switch .. case it doesn't work so well anymore.
Post 24 Aug 2022, 14:08
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20451
Location: In your JS exploiting you and your system
revolution 24 Aug 2022, 14:16
All sorts are slower than some other sort, and all sorts are faster than some other sort. It all depends on your usage.

The data set will dictate which sort works best.

If you blindly follow the O(....) notation as the gospel truth to the bestest ever sort, then you might be missing out on something better because you never tried other options.
Post 24 Aug 2022, 14:16
View user's profile Send private message Visit poster's website Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 1042
Location: Russia
macomics 24 Aug 2022, 14:24
revolution wrote:
All sorts are slower than some other sort, and all sorts are faster than some other sort. It all depends on your usage.

The data set will dictate which sort works best.

I agree with that. I was talking about the choice when I had already come to the dilemma of which algorithm to choose, when it was already clear that a common method would have to be used for sorting. The advantage of tsort is that it performs fewer comparison operations compared to qsort and fewer permutations in general.
Post 24 Aug 2022, 14:24
View user's profile Send private message Reply with quote
DimonSoft



Joined: 03 Mar 2010
Posts: 1228
Location: Belarus
DimonSoft 24 Aug 2022, 20:52
revolution wrote:
DimonSoft wrote:
...the boss more likely gives us a big kick out of the company.
If you think I should be fired then you'll have to convince my boss with some fantastic argument, because up till now my boss is very pleased to save money. My boss doesn't trust anyone to simply proclaim something superior without proof. And rightly so, we are not a religious organisation.

Ha-ha, cool ad hominem case. Nothing said about the cost of writing and functionally testing multiple implementations chosen blindly before actually doing measurements (or do you always write code with no bugs for even corner cases, or measure performance for code that doesn’t yet work right?). I’d love to see the world where every boss is ready to pay for unlimited research, yet most (luckily not all) companies I’m aware of are more willing to release some good-enough-but-as-fast-as-possible-within-the-time-limits version.

revolution wrote:
DimonSoft wrote:
Knowledge of what changes might result in significant performance impact is.
Of course, but the key word there is "might". You won't know until you test it. Otherwise just guessing and merely claiming it is awesome, without any evidence, would get you a very bad review in my company.

Yep. And then, after you’ve tested it, you make some changes in supposedly irrelevant piece of your program and — boom! — certain test results are not valid anymore due to changes to cache usage, data patterns, etc.

When people ask for help in optimization they expect some guidance on (a) better algorithms/data structures, (b) possible assembly-level tricks and their corner cases, both applicable to the particular case. There should be a disclaimer that everything said might not work in particular case and should be tested, but not the disclaimer only.

If one wants to help, they would better ask additional questions on what is known about the most frequent use cases for the implementation, or at least suggest some generic stuff to be tried. And “a good advice comes with a rationale”, so the suggestion goes with notes about when it might be faster and when might not. And then this is the place to say that only testing will let one know for sure. Still, you’d better also say here that testing will only let you know something about performance with this particular hardware, not about the common case for future users, but that’s at least something.

Also it’s worth keeping in mind that algorithmic optimizations tend to have more significant impact than lower-level ones, so just suggesting an O(N) algorithm instead of O(N^2) might be enough to even avoid discussing low-level optimizations, unless the one who asks has all his users run some very custom CPU than executes orders more instructions faster than a few ones.

Only saying that one should test (with no guidance on what possible alternatives there might be) is no better than writing “I don’t know”.
Post 24 Aug 2022, 20:52
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3, 4, 5, 6  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.