flat assembler
Message board for the users of flat assembler.

Index > Compiler Internals > final pass stage for call-graph ordered addressing

Author
Thread Post new topic Reply to topic
redsock



Joined: 09 Oct 2009
Posts: 360
Location: Australia
redsock
Hey Tomasz,

So here I was implementing the HTTP/2 spec that was very recently finalised, and in my local testing of some of the functions, I was taken aback by the speed difference based solely on function location.

When one of my supporting functions was located _before_ the primary decoding function, calling it (which was a backward reference from the calling function), versus calling it forward, resulted in nearly a 40% speed difference. (backward slower, forward reference == 40% faster).

I appreciate that this may be the result of very CPU-specific optimisation rules (cache poisoning and so on), but the way I built my library was to include all of the library _before_ the program entry point, and all of the library (thanks to fasm's forward/backward referencing passes) is in an "arbitrary" order (determined by me, when I include all of the library's include files).

Surely I can compose a "preprocessor" stage that doesn't present any of these difficulties to fasm directly, but it would be really cool if I could specify some option to lay out the actual function blocks in "call graph" (aka execution order). _start being the topmost, -> so on and so forth.

At least with every Intel and AMD CPU I have access to, these rules very much apply. I wonder even if the same rules (cache poisoning, etc) apply to static table data. Admittedly I have not researched this in greater detail.

If simply by reordering code on modern processors, tremendous speed gains can be achieved, it seems that third party tools to preprocess the code are less than optimal.

Does it help if I say please loud enough?

Smile Smile Smile

Cheers
Post 27 Feb 2015, 07:08
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17332
Location: In your JS exploiting you and your system
revolution
I have always laid out my code manually with a top down approach (which seems to be the opposite of most C code I have seen) so my entry point is first and all subordinate functions are below the caller.

As far as doing it automatically, either forward or backward depending upon the users desires, I would expect macros would be the answer here. Perhaps extend the "proc" macro in some way and then "postpone" to figure it all out later.
Post 27 Feb 2015, 07:26
View user's profile Send private message Visit poster's website Reply with quote
redsock



Joined: 09 Oct 2009
Posts: 360
Location: Australia
redsock
revolution wrote:
I have always laid out my code manually with a top down approach (which seems to be the opposite of most C code I have seen) so my entry point is first and all subordinate functions are below the caller.

As far as doing it automatically, either forward or backward depending upon the users desires, I would expect macros would be the answer here. Perhaps extend the "proc" macro in some way and then "postpone" to figure it all out later.
I agree in part, except that the "if used" construct allows me special privileges as it were. Specifically, that I can include them in an arbitrary way, and with multiple passes, the compiler figures out which ones to actually include. With logic very similar to this, a call-graph (or "if used" graph if you will) could be used to generate the code automatically in this fashion, without having to specifically code each with macros and/or postpone.

All non-local symbols could be subjected to the same rules, thereby also including static table data as well as functions.

If I end up having to write a preprocessor for it, it will ultimately have to follow the same logic as the "if used" component, and then start with the topmost (and as you say go forward or backward depending on user preference).

I guess my case is the "if used" code already does similar things as to my call graph request.

Smile
Post 27 Feb 2015, 07:32
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17332
Location: In your JS exploiting you and your system
revolution
For the non-local variables you can build a list:

http://board.flatassembler.net/topic.php?t=12012
Post 27 Feb 2015, 07:41
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17332
Location: In your JS exploiting you and your system
revolution
You would also need to override call/stdcall so as to know who is calling whom. Plus be careful with recursive functions calling themselves and/or each other.
Post 27 Feb 2015, 07:47
View user's profile Send private message Visit poster's website Reply with quote
redsock



Joined: 09 Oct 2009
Posts: 360
Location: Australia
redsock
revolution wrote:
For the non-local variables you can build a list:

http://board.flatassembler.net/topic.php?t=12012
This is the same technique I used for my global dataseg variables... for function ordering of course, this isn't possible. My HTTP/2 code at the moment requires 2 tables for decoding, and 2 tables for encoding. If I am careful about how I lay the functions and tables out inside my include files, I get the aforementioned speed gains. Given the call-graph dependency of my entire library though, I wonder what the overall difference would be if it were done by the compiler itself. I would only need to write a preprocessor to do the same functionality because there are ~1200 functions that are not necessarily dependent on each other. Call graph compiler-decides-who-goes-where would be way better than an "include everything and reference willy-nilly". Smile

I agree with you though, most of the code I write that isn't a library (and is one-off) gets done top-down anyway. I suppose because of my design choices with the library itself, call-graph ordered would potentially be MUCH better.

The "if used" logic of the compiler means that all of the symbols I define based on the "if used" directive don't have fixed addresses until the final stage.

If at the final stage, those same symbols could be ordered by the compiler (instead of my writing a preprocessor to do the same, which would not need the if used directive in the first place), that would be much better. Even better still if you could specify backward or forward order in same.
Post 27 Feb 2015, 07:55
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17332
Location: In your JS exploiting you and your system
revolution
There are still more opportunities for local variables. You can order larger structures first and smaller structures last so that [ebp-offset] has a better chance of fitting in one signed byte. Also have some way tell the assembler which is a highly used member and prioritise it to the end of the list.
Post 27 Feb 2015, 08:02
View user's profile Send private message Visit poster's website Reply with quote
redsock



Joined: 09 Oct 2009
Posts: 360
Location: Australia
redsock
Oh, I thought I should point out:

In order for the "if used" logic to work as it does across multiple passes, it must by default have knowledge of the call-graph in question.

Whether the compiler is currently keeping track of this order or not, well, it most likely is not. And this is the crux of the feature I am requesting.

When the "if used" directive is employed (especially across the board as I have done), if the compiler kept even a simple list of dependencies, the call graph is done then and there in the normal course of operations when using the "if used" directive.

From there, at the "right before we commit and write some code" stage, that same dependency list could be traversed to determine function order (whether forward or backward), _this_ is what I am after.

It isn't a far stretch from what is there already, and would be very useful IMO for code generation.

Smile Smile
Cheers again, and revolution, you are a legend as always.
Post 27 Feb 2015, 08:06
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17332
Location: In your JS exploiting you and your system
revolution
The "if used" feature is very simple minded and doesn't build any call graph or dependencies AFICT. I think it is a red-herring and independent to what you want to do.
Post 27 Feb 2015, 08:10
View user's profile Send private message Visit poster's website Reply with quote
redsock



Joined: 09 Oct 2009
Posts: 360
Location: Australia
redsock
revolution wrote:
"if used" is very simple minded and doesn't build any call graph or dependencies AFICT. I think it is a red-herring and independent to what you want to do.
How fortuitous that we are both lurking at the same moment in time Smile

The "if used" directive, if indeed it is simple minded as you say, allows me to declare 80 functions in a single include file, and only the ones I reference _outside_ get included the resultant binary.

Similar rules apply to all of my data structures. So, in my library there are a huge number of data tables, but only the ones that are actually referenced end up in a given binary.

As such, I categorically deny that there is no notion of a call graph, there MUST be (or the compiler could not pull this feat off to begin with).

Perhaps it is semantics insofar as my using the word "call graph" when I really mean "symbol dependency graph, whether function or table or data or whatever graph".

Either way, fasm already knows about what I am doing dependency-wise, and that is precisely what I want ordered.

Smile
Post 27 Feb 2015, 08:15
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17332
Location: In your JS exploiting you and your system
revolution
If just keeps a boolean flag to say whether it is referenced. By default they all start as not referenced and each successive pass will set more of the flags as more functions get referenced. The assembler keeps no data about who referenced it or if more than one function referenced it.
Post 27 Feb 2015, 08:18
View user's profile Send private message Visit poster's website Reply with quote
redsock



Joined: 09 Oct 2009
Posts: 360
Location: Australia
redsock
revolution wrote:
If just keeps a boolean flag to say whether it is referenced. By default they all start as not referenced and each successive pass will set more of the flags as more functions get referenced. The assembler keeps no data about who referenced it or if more than one function referenced it.
Hence my request Smile
Post 27 Feb 2015, 08:22
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17332
Location: In your JS exploiting you and your system
revolution
Okay, I misunderstood. You want to enhance the "used" flag to store more details. But can't the macro feature also do this for you?
Post 27 Feb 2015, 08:28
View user's profile Send private message Visit poster's website Reply with quote
redsock



Joined: 09 Oct 2009
Posts: 360
Location: Australia
redsock
revolution wrote:
Okay, I misunderstood. You want to enhance the "used" flag to store more details. But can't the macro feature also do this for you?
The "if used" feature has some interesting features above and beyond macro capabilities, such as:
Code:
call somefunction    
, in this case, I can easily use macros to determine some kind of call graph (though inside the context of "if used", I am not entirely sure how this would really work). When I do things instead like
Code:
mov rax, [someglobalsymbol    
, and said global symbol is enclosed in the "if used" directive, how then do I use macros to logically order all of my program symbols based on dependency, when in fact the compiler already has knowledge during its multiple passes of this same dependency?

I don't see how macros could solve symbol dependency based code generation, unless the code I wrote across the board were horrifically done, or a preprocessor is used to filter the assembly prior to fasm seeing it in the first place.

I am certainly open to suggestions if there is an easy way to get around that. All up there are ~1800 symbols if I set the "include_everything" flag in my library (in which case, inside the context of this feature request, code would end up as-is, or better put, as I specified it in the code/compile-time).
Post 27 Feb 2015, 08:37
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17332
Location: In your JS exploiting you and your system
revolution
So if you introduce some new capabilities into the used flag how do you propose to exercise the capability? A new directive?
Post 27 Feb 2015, 09:08
View user's profile Send private message Visit poster's website Reply with quote
baldr



Joined: 19 Mar 2008
Posts: 1651
baldr
redsock,

Probably it worth to investigate why and where reordering gives such a speed-up at first.

Topological sort based on cross-reference may yield adverse results. Function can be referenced in one place and indirectly used somewhere else.

Can you make up an example to experiment with and see the difference?
Post 27 Feb 2015, 11:22
View user's profile Send private message Reply with quote
l_inc



Joined: 23 Oct 2009
Posts: 881
l_inc
redsock
First of all it is very questionable that the function ordering directly affects performance. While cache conflict misses indeed may cause performance degradation in case the cache associativity is just a bit not enough to service very often called functions, such a situation is very unlikely. What's much more likely to cause a significant performance difference is alignment. Especially if your CPU is new enough to have the decoded iCache and the loop stream detector (present since Sandy Bridge, iirc). Read the corresponding chapters in the Intel's optimization manual, and you'll probably won't need anything but to use the align directive.

Secondly, as long as the function ordering measure is very machine dependent, it may both increase as well as decrease performance no matter what ordering rules you apply. For that reason I would not recommend to embed this optimization into a compiler. Another reason to vote against the feature is that an assembler compiler and especially fasm should not alter the layout directly imposed by the source code ordering. Any modifications on the binary layout should be achieved using macros logically decoupled from the assembler.

Thirdly, from my initial view it seems to be possible to create the call graph using the current fasm capabilities. Function reordering according to the call graph within a single compilation is only potentially possible, but definitely impractical. A more or less efficient way is feasible by setting up a feedback from the assembler output into the preprocessor by using multiple compilations, making it a potential use case for this feature. Having said that I have to point out to the baldr's remark regarding indirect calls that might be impossible to resolve at compile time.

_________________
Faith is a superposition of knowledge and fallacy
Post 27 Feb 2015, 14:12
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on YouTube, Twitter.

Website powered by rwasa.