flat assembler
Message board for the users of flat assembler.

Index > Main > clearing the bit 63 of a register

Goto page Previous  1, 2
Author
Thread Post new topic Reply to topic
Furs



Joined: 04 Mar 2016
Posts: 2545
Furs 26 Sep 2022, 14:25
revolution wrote:
And as with all of these techniques, testing will guide better than guessing or assuming (or counting cycles from Agner Fog's files).
I always wonder what would you do if you were to write a compiler optimizer for target architecture (i.e. low level optimizations, like instruction scheduling, like GCC's RTL, not GIMPLE), especially for a "generic" target tuning so it can run on as many CPUs reasonably fast as possible. It's a serious thing to ponder. Wink
Post 26 Sep 2022, 14:25
View user's profile Send private message Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 1020
Location: Russia
macomics 26 Sep 2022, 14:41
In order to make an optimization of the "generic" type, you must first formulate at least two other optimizations for specific CPUs. Only then will the general points become clear and it will be possible to compile a "generic" mechanism. Your example is bad. Come up with more.
Post 26 Sep 2022, 14:41
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20430
Location: In your JS exploiting you and your system
revolution 26 Sep 2022, 22:45
Furs wrote:
]I always wonder what would you do if you were to write a compiler optimizer for target architecture (i.e. low level optimizations, like instruction scheduling, like GCC's RTL, not GIMPLE), especially for a "generic" target tuning so it can run on as many CPUs reasonably fast as possible. It's a serious thing to ponder. Wink
You are still focussed on the instructions. In many cases it won't matter which instructions you use. Instead focus on the data flows.

Changing an instruction may give you some minor improvements here and there. Changing your data flows can give you major improvements everywhere.

Write your HLL code to optimise the data flows, this will likely be the most fruitful thing you can do to get good performance.

Then let the compiler to the minor things like doing instruction selection, this may give a small amount of improvement, or it may make no difference, or it may get it wrong and give a slowdown. But these differences will likely be minor compared to getting the data flows running well.
Post 26 Sep 2022, 22:45
View user's profile Send private message Visit poster's website Reply with quote
DimonSoft



Joined: 03 Mar 2010
Posts: 1228
Location: Belarus
DimonSoft 27 Sep 2022, 10:08
revolution wrote:
DimonSoft wrote:
... and the possibility of multiple instructions running faster than a single one is limited to the processor’s internals, especially the number of pipeline stages and pipelines
That isn't what I meant.

You can, for example, do stream pre-loads then process data with more instructions (that theoretically take longer) while the data are in the CPU, and stream store results. Bonus points for doing overlapped, loads with processing of the previous data. Compared to using fewer instructions on a single datum point with multiple rounds into memory and back for each layer of computation.

You can also do more with multiple fields in a data structure while you have it in a register. To save storing it as separate fields. Sometimes this needs more instructions to complete the computation, but is an overall time saver if memory accesses are optimised well.

And as with all of these techniques, testing will guide better than guessing or assuming (or counting cycles from Agner Fog's files).

All nice and cool, but if people denied to analyze and predict we’d have no science at all. Prediction capability of science is what makes it useful. We know the target architecture (it’s either specified explicitly or implied) and have a discussion of possible implementations. Measuring gives information about particular CPU model, no more no less. But having such cool things as previous experience (take a look into the dictionary) and knowledge about how things work it’s perfectly valid to attempt to predict the results for not only particular model but for a wider range of CPUs not readily available for measurements and future models as well, to some extent.

The suggestion to look at data flow is all good and great, but after it is done, the same question arises: is this implementation good enough for this particular data flow or something can be done to improve it with little effort and great effect? In fact, every performance-related question really implies this and the possibility that, if current implementation shown is obviously bad from data flow POV, somebody suggests an alternative.

People ask to solve their particular problem within their particular limitations, conditions and factors. And they expect guidance for their particular case, meanwhile learning something new. Suggestions to measure and watch for data flow are good in general but have little to no value by themselves.

It’s like, when faced with a question of what speed might the falling ball have when it touches the ground, saying “different types of matter tend to move, different forces tend to affect the moving objects, throw it and see”. Thanks, Captain, but I needed the answer to prevent the ball from exploding and it was unique. “If you fall down without a parachute, is it better to gather your arms and legs as close as possible or to spread them as wide as possible?” (“to survive” implied, “falling onto the Earth” implied, “average human being with average clothes” implied). “Go on and try both!” Thanks, I might not have a chance to do the second measurement.

“Is it worth taking a large stone to the Moon to prevent the spaceship from being moved by winds?”—“Pay billions more to be able to take it, then take it and try”. Really explains why people tend to doubt if people have ever been to the Moon.
Post 27 Sep 2022, 10:08
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20430
Location: In your JS exploiting you and your system
revolution 27 Sep 2022, 11:29
It's about directing limited resources to the places that matter the most.

If the CPU outpaces the memory system by a handy margin (like every system made today) then improving your CPU instruction latency/throughput is kind of dumb when you can instead get the data flowing better, and not have worry so deeply about what the CPU does because the CPU is spending most of its time waiting for data.

If you have a very niche application (like say Prime95 or ycruncher) and testing shows that the CPU performance is also going to be important, so go for it, and keep trying your options for instructions choice. And when you have the CPU once again taking less time to compute than it takes to get the data to/from the memory then you can stop wasting time with the CPU stuff. Use testing to show you when that point is reached.

If you are memory bound (like most common apps are) then improve your memory accesses, If you are CPU bound then work on that area. Only testing can show you which situation you are in.

If it is important enough to "get it faster" then it is important to make sure it really is faster.
Post 27 Sep 2022, 11:29
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2545
Furs 27 Sep 2022, 13:22
revolution wrote:
You are still focussed on the instructions. In many cases it won't matter which instructions you use. Instead focus on the data flows.

Changing an instruction may give you some minor improvements here and there. Changing your data flows can give you major improvements everywhere.

Write your HLL code to optimise the data flows, this will likely be the most fruitful thing you can do to get good performance.

Then let the compiler to the minor things like doing instruction selection, this may give a small amount of improvement, or it may make no difference, or it may get it wrong and give a slowdown. But these differences will likely be minor compared to getting the data flows running well.
That's not answering the question. In fact, this goes totally against what you claimed earlier to test.

I'm saying that you, or the compiler, or whoever, has to emit an instruction for this purpose. The question is which? bt? shl/shr? Decide quick, without testing, because that's far from quick.

My take: Agner Fog's manuals say bt is fast on register operands, so go with bt.
You: Test and waste your time even though it doesn't matter because data flow matters more.

Seriously, you don't see how bad your advice is in general? If it doesn't matter then even more reason to use rule of thumbs or "quickies" from Agner Fog's manuals. It's exactly how I'd write a compiler optimizer as well.

Only in truly critical spots do you go out and test them!

bt is also smaller so if you code for size it's still the best advice. Wink
Post 27 Sep 2022, 13:22
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20430
Location: In your JS exploiting you and your system
revolution 27 Sep 2022, 13:32
Furs wrote:
I'm saying that you, or the compiler, or whoever, has to emit an instruction for this purpose. The question is which? bt? shl/shr? Decide quick, without testing, because that's far from quick.
You are still focussed on the instruction choice. I said it probably won't matter unless you have some niche application where the CPU is the bottleneck.

So use whatever you want. I almost all cases you won't be able to see any difference, so therefore it doesn't matter which you choose.

For aesthetics reasons, I would choose BT, simply because it is the most obvious about its function. If later testing shows I have a bad CPU bottleneck then I can revisit that choice.
Post 27 Sep 2022, 13:32
View user's profile Send private message Visit poster's website Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 456
Location: Marseille/France
sylware 28 Sep 2022, 13:46
For a program to run "efficiently" on different flavors of an hardware architecture (which improves and evolves over time due to hardware improvment) can become a technically expensive problem to solve.

Basically, a "generic" bootstrap code would have to "build and install" the hardware flavor "optimized" code... hello gentoo elf/linux like distros (aka source based distro). Everything else would be half-baked, technically very expensive, solutions for this problem.

This is what the linux kernel is trying to do with runtime code patching it is technically horribly expensive (and a bloody mess). This is "bad", really nasty. (Probably what doz do, but not rottenfruit since they do handle their own hardware).

This is why, in the end, rottenfruit OS is supposed to "beat" doz, they deploy hardware flavor optimized code. Their objective-C and C code is compiled directly for a specific hardware flavor, and for "the average load", runtime patching cannot really "beat" that or by investing much more efforts than "recompiling".

My own opinion on the matter and very roughly, is to go for the simplest assembly code, then IF LOOKING FOR GAIN IS PERTINENT AND SIGNIFICANT, do run real life and _pertinent_ load profiles in order to "optimize" them, then code hardware flavors optimized of the "bottleneck" code paths and... then try to find the "right" way to propose those hardware flavor optimized code paths: runtime switch, runtime patching (usually a mess and expensive in the general case), build and install time.

For the real free software world, the less worse compromise would be, in my opinion, to go to full generic and simple assembly, and cherry pick some very specific code paths to optimize and propose most of them at build/install time, and if inexpensive (unlike mega binary patching all over the code), with a runtime switch. For instance, for a kernel I would propose the alternative code paths at build/install time, but for a video codec, I could propose them with a runtime switch (like AV1 dav1d).

In the end, it depends on the context of those hardware flavor optimized code paths... AND the technical cost and efficiency of the "infrastructure" to propose them.

And no, compilers are not welcome to solve this question because they bring way to many issues, super toxic issues, on their own which make them a nono with the perspective we have after half a century of experience.
Post 28 Sep 2022, 13:46
View user's profile Send private message Reply with quote
DimonSoft



Joined: 03 Mar 2010
Posts: 1228
Location: Belarus
DimonSoft 30 Sep 2022, 13:44
sylware wrote:
Basically, a "generic" bootstrap code would have to "build and install" the hardware flavor "optimized" code... hello gentoo elf/linux like distros (aka source based distro). Everything else would be half-baked, technically very expensive, solutions for this problem.

I always liked it how C# and Java creators decided installing a compiler with all the infrastructure (.NET/Java) to end-user machine would be a good idea. Especially if the end users don’t even know or need to think about it.

Since then I can’t really understand what’s wrong with installing miners silently and without asking the end user.
Post 30 Sep 2022, 13:44
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.