flat assembler
Message board for the users of flat assembler.

Index > Macroinstructions > align

Author
Thread Post new topic Reply to topic
Hugh Aguilar



Joined: 15 Nov 2011
Posts: 62
Location: Arizona
Hugh Aguilar 26 Mar 2013, 04:14
I have this:

macro align_entry
{
align 16
}

This goes in front of all my functions. I would like these features though:

1.) Assemble one large instruction rather than several NOP instructions. This way, if I have one function that falls through to the next function, it will be faster.

2.) If the amount of space in front is >= 8 bytes, then don't do anything, but if the if space in front is < 8 bytes, then align to the next 16-byte boundary assembling a nop instruction in there. This way, I never waste more than 8 bytes between functions.

I'm writing a Forth that uses ITC (indirect-threaded code) and I have hundreds of small functions. I want the VM to fit in 32K however, so I don't want ALIGN_ENTRY to waste a lot of memory.

My understanding, is that the x86 loads a paragraph (16 bytes aligned) at a time. The purpose of aligning the entry to functions on 16-byte boundaries, is to minimize how many paragraphs have to be loaded to execute the function. How important is this for speed? There is a trade-off between doing this, and fitting the VM in the 32K code-cache. Which is more important? Is my idea of not wasting more than 8 bytes (an average of 4 bytes), a good compromise? Also, I'm not aligning at all in front of the functions that don't get executed very much --- like double-precision arithmetic, for example.

I can likely figure out how to write this macro myself, in which case I'll post it here --- but if anybody already has something, please post it --- this seems like a pretty common problem that others should have encountered before.

Thanks for your help --- Hugh Smile
Post 26 Mar 2013, 04:14
View user's profile Send private message Send e-mail Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20451
Location: In your JS exploiting you and your system
revolution 26 Mar 2013, 04:19
Hugh Aguilar wrote:
1.) Assemble one large instruction rather than several NOP instructions. This way, if I have one function that falls through to the next function, it will be faster.
Search this board for multi-byte nops. There are a few topics already here.
Hugh Aguilar wrote:
How important is this for speed?
Almost completely negligible and probably not worth wasting time on except for the most extremely time critical parts of a program.
Post 26 Mar 2013, 04:19
View user's profile Send private message Visit poster's website Reply with quote
Hugh Aguilar



Joined: 15 Nov 2011
Posts: 62
Location: Arizona
Hugh Aguilar 26 Mar 2013, 05:07
revolution wrote:
Hugh Aguilar wrote:
How important is this for speed?
Almost completely negligible and probably not worth wasting time on except for the most extremely time critical parts of a program.


If it is not important to minimize the number of paragraphs loaded to execute a function, then I won't bother with aligning the entry points at all. That will save memory, and I know that keeping the VM within 32K is important.

The Intel Optimization manual advised to always paragraph-align function entry points, so it must have some effect on speed.

The reason why I put it in a macro, rather than just say ALIGN 16 in front of every function, was so I could disable it for the entire program if I found that it didn't help the speed enough to justify all of the wasted memory.
Post 26 Mar 2013, 05:07
View user's profile Send private message Send e-mail Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20451
Location: In your JS exploiting you and your system
revolution 26 Mar 2013, 06:14
Hugh Aguilar wrote:
The Intel Optimization manual advised to always paragraph-align function entry points, so it must have some effect on speed.
Yes it does have an effect. But only the most time sensitive functions will ever show any difference at runtime. For almost all programs running today data latency is far more important.
Post 26 Mar 2013, 06:14
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4073
Location: vpcmpistri
bitRAKE 26 Mar 2013, 07:24
The only need for code alignment is due to the decoder - which only fetches a paragraph per cycle, or not. IIRC, this has been the case through several generations of Intel processors, can't recall the AMD situation, but it's easy enough to search. Agner Fog has a wonderful paper on the microarchitecture.

Basically, the processor can decode 5/6 instructions per cycle if they are in the same paragraph. Which means that alignment could make it worse in some cases. Smile Where needed inner loops can be massaged into the buffer past the decoder. The most recent processors even make that superfluous because they can cache decoded instructions.

I know there is an alignment thing with the branches, too. Something like 3 conditional branches in a paragraph are bad, but I've never come up against it. I'll look it up when it happens. Read about the BTB (branch target buffer) if you are interested in that.

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 26 Mar 2013, 07:24
View user's profile Send private message Visit poster's website Reply with quote
AsmGuru62



Joined: 28 Jan 2004
Posts: 1671
Location: Toronto, Canada
AsmGuru62 26 Mar 2013, 13:06
The C compiler usually alings the function entry on 32 bytes.
Yet, the manual says: "Align all branch targets on 16 bytes boundary".
Weird.
Post 26 Mar 2013, 13:06
View user's profile Send private message Send e-mail Reply with quote
Hugh Aguilar



Joined: 15 Nov 2011
Posts: 62
Location: Arizona
Hugh Aguilar 27 Mar 2013, 05:23
bitRAKE wrote:
Basically, the processor can decode 5/6 instructions per cycle if they are in the same paragraph. Which means that alignment could make it worse in some cases. Smile Where needed inner loops can be massaged into the buffer past the decoder. The most recent processors even make that superfluous because they can cache decoded instructions.

I think you're saying that it is best in some case to align the function in such a way that the entry point isn't necessarily on a paragraph boundary, but the top of a loop in the function is on a paragraph boundary. But some processors cache the decoded instructions so that a loop will be cached and won't have to be decoded again.

Anyway, it looks like I ought not to waste memory by aligning functions to paragraph boundaries, as it is more important to save memory and fit as much code in the 32K code-cache as possible.

I have hundreds of small functions. Perhaps after the whole thing is running, I can jumble up the order of the functions --- never mind putting related functions next to each other for readability --- put them in such an order that their entry points mostly hit paragraph boundaries, but without any wasted memory in between. It would be like solving a jigsaw puzzle. I wonder if a program could be written to solve the puzzle automatically?
Post 27 Mar 2013, 05:23
View user's profile Send private message Send e-mail Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4073
Location: vpcmpistri
bitRAKE 27 Mar 2013, 11:26
Not just loops, but any instruction that straddles a paragraph can cost another cycle and branch targets can force the previous 15 bytes to be loaded if an instruction is on a boundary. If a loop doesn't fit in the dispatch queue {after decoding} then it needed to be decoded every iteration and you were keen to notice that compounded penalty.

I've taken the philosophy of size coding everything unless speed is an issue as well. As to code locality optimization that would be an interesting piece of software. Without a doubt a hard problem. Would it just use the code itself, or take into account run-time behavior (use cases)?

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 27 Mar 2013, 11:26
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20451
Location: In your JS exploiting you and your system
revolution 27 Mar 2013, 11:44
bitRAKE wrote:
Not just loops, but any instruction that straddles a paragraph can cost another cycle and branch targets can force the previous 15 bytes to be loaded if an instruction is on a boundary. If a loop doesn't fit in the dispatch queue {after decoding} then it needed to be decoded every iteration and you were keen to notice that compounded penalty.
Note that these sorts of general rules and figures change from CPU to CPU. Not all CPUs act this way. Some CPUs will be indifferent to such things (depending upon the exact situation) and some CPUs will give awful runtimes. It all depends upon exactly which CPU/mobo/RAM you use and what your code is actually doing.
Post 27 Mar 2013, 11:44
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4073
Location: vpcmpistri
bitRAKE 28 Mar 2013, 01:23
True, I'm just speaking about the last few generations of Intel desktop CPUs. The fetch has been 16 bytes for a long time.

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 28 Mar 2013, 01:23
View user's profile Send private message Visit poster's website Reply with quote
Hugh Aguilar



Joined: 15 Nov 2011
Posts: 62
Location: Arizona
Hugh Aguilar 28 Mar 2013, 02:13
bitRAKE wrote:
Not just loops, but any instruction that straddles a paragraph can cost another cycle and branch targets can force the previous 15 bytes to be loaded if an instruction is on a boundary. If a loop doesn't fit in the dispatch queue {after decoding} then it needed to be decoded every iteration and you were keen to notice that compounded penalty.


So if an instruction straddles a paragraph boundary, then both paragraphs have to be loaded to decode the instruction, even though the previous paragraph has already been loaded (assuming that this instruction isn't a branch target, but we just fell into it).

bitRAKE wrote:
I've taken the philosophy of size coding everything unless speed is an issue as well.

By "size coding" you just mean to make the code as compact as possible?

bitRAKE wrote:
As to code locality optimization that would be an interesting piece of software. Without a doubt a hard problem. Would it just use the code itself, or take into account run-time behavior (use cases)?

It is somewhat like the travelling-salesman problem. There is a big blow-up in how many configurations need to be tested.

I was assuming that it would just work from the listing file so that it can know how big each function is. It would be a good idea if, in the source-code, functions were bracketed with comments consistently, so the program could find these comments in the listing and know what is a "function."

It might help to give it hints as to which functions are speed-critical, so it can start with these.
Post 28 Mar 2013, 02:13
View user's profile Send private message Send e-mail Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4073
Location: vpcmpistri
bitRAKE 28 Mar 2013, 02:51
Hugh Aguilar wrote:
So if an instruction straddles a paragraph boundary, then both paragraphs have to be loaded to decode the instruction, even though the previous paragraph has already been loaded (assuming that this instruction isn't a branch target, but we just fell into it).
Only a single paragraph is fetched each cycle. So, if a two byte instruction has it's bytes split between two paragraphs then it will take two cycles to decode -- even if it's a two byte NOP. Of course, there were other instructions in the paragraphs - how did we get to this NOP if we didn't jump to it?

A, all previous bytes were executed
B. some previous byte was a jump target

In both cases the first paragraph is cached.

Quote:
By "size coding" you just mean to make the code as compact as possible?
Yeap.

Quote:
It is somewhat like the travelling-salesman problem. There is a big blow-up in how many configurations need to be tested.
Combinatorial explosion is a good term.

Instruction fetching has a good predictor - better to have loops aligned to paragraphs and instructions aligned to fill in decode bubbles. Which is highly code specific.

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 28 Mar 2013, 02:51
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4073
Location: vpcmpistri
bitRAKE 29 Mar 2013, 22:12
http://www.theadesilva.com/supnik.mp3

Here is an interesting interview with Bob Supnik. He gets into some philosophy of software history toward the end (30 minutes in).

http://archive.org/details/GETLAMP-Supnik
Post 29 Mar 2013, 22:12
View user's profile Send private message Visit poster's website Reply with quote
Hugh Aguilar



Joined: 15 Nov 2011
Posts: 62
Location: Arizona
Hugh Aguilar 02 Apr 2013, 04:41
bitRAKE wrote:
Hugh Aguilar wrote:
It is somewhat like the travelling-salesman problem. There is a big blow-up in how many configurations need to be tested.
Combinatorial explosion is a good term.


I had forgotten the term, but that is it.

I think I could write the program. It would be a pretty fun challenge, and unlike most fun challenges that I write programs for, it would be useful! Smile

bitRAKE wrote:
Instruction fetching has a good predictor - better to have loops aligned to paragraphs and instructions aligned to fill in decode bubbles. Which is highly code specific.


Well, I'm writing a Forth ITC system. I have hundreds of small functions, and maybe only 1% of them have loops in them. Iteration is typically done in high-level Forth with BRANCH and 0BRANCH and so forth --- these branch primitives are no different than any other primitive, in that they end in NEXT which does an indirect JMP (which the processor can't predict, apparently).

This is a lot different than a typical assembly-language program, that has relatively few functions, many of which have loops in them.

The program that I described above, that rearranges the functions, would be useful for a Forth interpreter, but not necessarily in a more traditional program --- it mostly helps when you have a lot of short functions that are only a few paragraphs long --- if you have a few big functions that are dozens of paragraphs long, it doesn't make much difference if they are paragraph-aligned or not, for either speed or memory savings.
Post 02 Apr 2013, 04:41
View user's profile Send private message Send e-mail Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4073
Location: vpcmpistri
bitRAKE 02 Apr 2013, 06:58
Thank you, that makes a lot of sense now.

You probably are thinking of cacheline alignment as well. Minimizing cache thrashing on critical paths. Not just the L1 fetching, but the complete process of moving code through memory.

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 02 Apr 2013, 06:58
View user's profile Send private message Visit poster's website Reply with quote
Hugh Aguilar



Joined: 15 Nov 2011
Posts: 62
Location: Arizona
Hugh Aguilar 04 Apr 2013, 05:38
bitRAKE wrote:
Thank you, that makes a lot of sense now.

You probably are thinking of cacheline alignment as well. Minimizing cache thrashing on critical paths. Not just the L1 fetching, but the complete process of moving code through memory.

By "cacheline alignment," I assume that you mean the 64-byte pages, of which we have 32K total memory. Yes, I consider that too. For the core kernal, I just pack as much as possible together, so as to minimize the number of pages used. For extra functions however, I put related functions together in a group that is page aligned, and which extends to just below a page boundary. For example, double-precision arithmetic is a group of related extra functions. These are extra in the sense that a lot of programs won't use them at all, but if a program does use any of them then it is likely to use all of them.

I had originally hoped to just put the entire VM in the 32K. If you consider all of the extra functions however, then it is unlikely to fit in the 32K. Hopefully most programmers won't use all of the extra functions in the same program though. Also, if they do, they won't use them all at the same time --- for example, doing numerical stuff and doing string stuff, I would expect to happen at different times during the course of a program execution.

I haven't programmed the x86 since the mid-1990s, so these modern issues are new to me --- but several guys on CLAX have been giving me advice, and I've been reading the Intel Optimization manual, so I'm learning. Smile
Post 04 Apr 2013, 05:38
View user's profile Send private message Send e-mail Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.