flat assembler
Message board for the users of flat assembler.

flat assembler > Blog > Types of multi-pass assembly

Author
Thread Post new topic Reply to topic
Tomasz Grysztar
Assembly Artist


Joined: 16 Jun 2003
Posts: 7005
Location: Kraków, Poland
As any user of fasm probably knows, it is a multi-pass assembler and the basic function of the multiple passes is to optimize generated code so that uses short forms of the instructions whenever possible (fasm's multi-pass with its unrestricted forward-referencing can do much more, but that is a story for a different day).

What might be less commonly known is that there are actually different kinds of multi-pass assembly and the one that fasm is using is not the same as the one usually described when talking about this topic.

The initial pass of multi-pass assembly is done to get at least a rough idea of the sizes/positions of all instructions. When fasm does it, it uses the optimistic assumption that all instructions should have their shortest possible form. In the next pass, assembler already has a first draft of all the values that may affect sizes of generated instruction, and it may turn out that the optimistic assumption was too optimistic, and some parts of the code must be enlarged. This in turn changes some of the addresses and assembler needs another pass to find out if even more instructions have to be larger than initially assumed. It repeats this process until all the values are stable and only then generated code is valid. Because if during the last pass all the addresses stayed the same as they were in the pass before it, the instructions generated from the values from the previous pass turn out correct.

A more traditional approach was to use pessimistic assumptions instead of fasm's optimistic ones. That is, every instruction is at first generated large and is shrunk only if in the later pass it turns out that the range of addresses allows it to be shorter.

The advantage of pessimistic multi-pass is that it usually can be ended after any of the intermediate passes and all the instructions can be updated with the addresses collected during the last of these passes to create the correct code. The instructions were generated pessimistically, so every one of them should be large enough to hold the correct values.

The optimistic variant has the intermediate passes where some instructions are too small to hold the right values and this code cannot be fixed in such a simple way. The assembler has to continue with further passes until it reaches the solution.

Note that both the pessimistic and optimistic multi-pass may fail to find the solution in specific circumstances, like what I call the oscillator problem. But this does not happen when optimizing things like jumps in x86 architecture, where shrinking one instruction never forces any other instruction to be enlarged or vice versa.

The advantage of optimistic multi-pass is that is able to find solutions that are smaller. This can be demonstrated on a very simple example:
Code:
start:
        db      126 dup 90h
        jmp     start    
When a pessimistic assembler at first generates the larger form of jump, the distance to the "start" label becomes larger than 128 bytes and assembler no longer sees that this jump could be a short one. An optimistic assembler first tries the short jump and notices that it is enough.

But we can do more, it is in fact possible to use fasm to test and compare both approaches to multi-pass assembly.

First, we need some code for testing. I'm going to generate an artificial one, consisting only of jump instructions, with help of fasmg. I put the following code into TEST.ASM:
Code:
format binary as 'inc'

xorshift = 11111111h

struc rand32
        . = xorshift shr 96
        . = ( . xor (. shl 11) xor (. shr 8) xor (xorshift) xor (xorshift shr 19) ) and 0FFFFFFFFh
        xorshift = (. or xorshift shl 32) and 0FFFFFFFF_FFFFFFFF_FFFFFFFFh
end struc

struc rand limit
        while 1
                . rand32
                . = . and (1 shl (bsr (limit) + 1) - 1)
                if . <= limit
                        break
                end if
        end while
end struc

NUMBER_OF_JUMPS = 10000

repeat NUMBER_OF_JUMPS
        db 'position',`%,': '
        R rand 100
        repeat 1, target:(% + R) mod NUMBER_OF_JUMPS + 1
                db 'jmp position',`target,10
        end repeat
end repeat    
and assemble it with fasmg to get TEST.INC that looks like this:
Code:
position1: jmp position53
position2: jmp position20
position3: jmp position97
position4: jmp position56
position5: jmp position6
position6: jmp position83
position7: jmp position42
position8: jmp position60 
...    
The jumps are randomized, but mostly stay within a distance of 100 instructions, this should result in a mix of short and near jumps and a sufficient challenge for an assembler. I set the initial seed for the PRNG to a constant, so that the same test would be generated every time. It can be changed to some other value (even %t) to generate different tests.

Now, to test optimistic multi-pass we could try just assembling it with fasm. But to get a better view of what is happening, we are going to implement the jump instruction as a macro. This is something that should work the same with fasm 1 and fasmg, only the basic macro syntax (braces for fasm 1, "end macro" for fasmg) needs to be changed accordingly. I'm going to use the fasm 1 syntax here:
Code:
macro jmp target
{
        if ~ defined target | ( target-($+2) < 128 & target-($+2) >= -128 )
                db 0EBh,target-($+1)
        else
                db 0E9h
                dd target-($+4)
        end if
}

include 'test.inc'    
In the initial pass a short jump is generated because of the "~ defined target" condition. In the consecutive passes the updated distance to the target is checked and the size of instruction is chosen appropriately.

This generates 40595 bytes in 9 passes.

Now we can alter the macro to make a pessimistic variant:
Code:
macro jmp target
{
        if ~ defined target | target-($+2) >= 128 | target-($+2) < -128
                db 0E9h
                dd target-($+4)
        else
                db 0EBh,target-($+1)
        end if
}

include 'test.inc'    


This gives 40811 bytes in 7 passes. In my tests sometimes one method takes more passes and sometimes the other one, but usually it is the pessimistic one that requires less of them.

Keep in mind that in both cases fasm's label prediction algorithms are applied when the distance to the target is computed and another implementation could give a different results. But fasmg gives the same outcome as fasm 1, they are mostly compatible at this level.

What about terminating the pessimistic pass prematurely to keep the partially optimized code as a good-enough solution? This also can be emulated with a variation of these above macros:
Code:
LAST_PASS = 2

if ~ defined PASS
        PASS = 1
else if PASS >= LAST_PASS
        PASS = LAST_PASS
else
        PASS = PASS + 1
end if


macro jmp target
{
        local SMALL
        if (PASS = LAST_PASS & ~ SMALL) | ( PASS < LAST_PASS & ( ~ defined target | target-($+2) >= 128 | target-($+2) < -128 ) )
                db 0E9h
                dd target-($+4)
                SMALL = 0
        else
                db 0EBh,target-($+1)
                SMALL = 1
        end if
}

include 'test.inc'    
The PASS and LAST_PASS variables play tricks with fasm's engine in order to determine if the last allowed pass has been reached. Note that fasm still does one additional pass after that, because it needs all the values to stay unchanging to declare the solution valid, and the tricky PASS variable kept changing.

Now this variant start pessimistically and keeps optimizing the instructions until the last pass is reached. At that point it keeps the instruction size as they were.

With LAST_PASS = 2 this does not optimize the instructions at all. In the second pass it just fills the fields with right values and the code is ready. (Then comes the third pass because fasm does not trust the PASS variable but this does not change anything in the code anymore.)

With LAST_PASS = 3 generated size is 42389. With 4 passes it is 41132, with 5 - 40886, with 6 - 40820, with 7 - 40811 and at this moment it stabilizes. Further passes give no improvement.

This shows how with pessimistic variant we get smaller and smaller output and we can take any intermediate stage and still have a correct, just less optimized code. It is never as small as with the optimistic multi-pass, though.

One more thing to add is that in the pessimistic multi-pass the last of the passes does not really need to go over all the instructions again. In the previous pass it could have created a table of relocations - offsets of all the fields within code that contain addresses (either relative or absolute), and information what addresses have to be put there. Then instead of the final pass one can just process this relocation table and update all the values in code accordingly. The early assemblers did not even do that, they just stored this complete relocation table in the generated object file and it was the linker that later processed it (this was saving time). So the LAST_PASS = 2 setting in the above sample did in fact simulate a single-pass assembler, because what the second pass was doing the same job as simply applying the relocation table.
Post 11 Dec 2017, 22:08
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16140
Location: Hyperborea
For THUMB code fasmarm is optimistic also, but I changed the behaviour to be "sticky". The first pass will allocate the smaller instruction. If a future pass finds that the larger instruction is needed then that will remain larger for all following passes. It can be thought of as a "sticky" allocation. I found that this significantly reduced the number of times I saw the error "code cannot be generated" while still producing optimal code in almost all cases.

This "sticky" approach becomes especially important when the IT instruction is available. There are a lot of opportunities to choose either a long or a short encoding for almost all instructions, not just branches, so the number of permutations expands greatly and it could take many years to perform enough passes to finally produce a working output.
Post 12 Dec 2017, 00:55
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar
Assembly Artist


Joined: 16 Jun 2003
Posts: 7005
Location: Kraków, Poland
There are many x86 instructions that have longer and shorter forms, it mostly comes down to the choice between 8-bit or larger displacement or immediate. But normally only the ones that contain a relative offset in such field require multiple passes, and they are mostly branches. But if a programmer uses a relative offset as a data for arithmetic instruction, such instruction is also going to behave like a jump during the optimization:
Code:
        add     ecx, end - start    
When sizes of some instructions change, this alters the relative offsets and may in turn change the length of an instruction like the above one.

I have already mentioned earlier that the reason why x86 optimization does not have the problem of "oscillations" is that under normal circumstances making an instruction smaller cannot cause any instruction to become larger, because when relative offsets decrease the instruction may only get smaller. Similarly, making an instruction larger cannot make any instruction smaller, as the relative offsets only grow. Thanks to this both the pessimistic and optimistic approach have no problem reaching the solution every time.

With fasm's unrestricted forward-referencing you can write some self-dependencies that may create "size oscillation" even with x86 instructions, but this would count as highly unusual circumstance. But when you have an architecture where changing size of an instruction may cause the opposite shift elsewhere (or even in the very same instruction, like in the case of alignment-dependent encodings) then this problem has to be taken into account.
Post 12 Dec 2017, 09:13
View user's profile Send private message Visit poster's website Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2311
Location: Usono (aka, USA)
I hope I'm not intruding or wasting your time here. I'm of limited input.

Back when I recently converted PSR Invaders to various assemblers (from TASM's MASM mode), I noticed that a lot of assemblers didn't fully handle forward references or size optimizations. A86 is one-pass and only optimizes backwards jumps for size. NASM 0.97 didn't do any optimizations at all (strictly two-pass only), presumably to conserve memory, so the workaround was probably to manually tell it exactly what size you wanted (if it truly mattered, which it often doesn't). Later NASM did get -Ox and made it default. YASM always had multi-pass optimizations (but it was buggy for a while). SmallerC guy swears by YASM since it's much faster for him. Even TASM supposedly didn't get multi-pass until 2.0, and the default INVADERS.COM binary forgot to enable it! So it's 9356 vs. 9194 bytes (which is apparently one extra 512-byte sector wasted, heheh). And yet UPX compresses them almost to the same size anyways, go figure.

Obviously older computers and assemblers were just slow (EDIT: see here at bottom). Requiring a linker sounds bad, but a lot of people insist upon modularity, separate libraries, etc. Also, memory was somewhat scarce, but it's always a tradeoff.

Size doesn't matter. I wish it did, but nobody cares. Even in the old days, almost nobody cared. One of the worst culprits is non-smart linkers and bloated routines like printf (which in latest DJGPP is ridiculously large, comparatively), but you can also blame static linking for that.

Actually, size can still matter (with small cpu cache), but modern Intel/AMD are more copious than the old days. Although I never had a Mac, I found an interesting mention on Wikipedia:

PowerPC wrote:

Apple tried to use the 603 in a new laptop design but was unable due to the small 8 KiB level 1 cache. The 68000 emulator in the Mac OS could not fit in 8 KiB and thus slowed the computer drastically. The 603e solved this problem by having a 16 KiB L1 cache which allowed the emulator to run efficiently.


So in some rare cases, size can still matter, I suppose (although I assume even PowerPC nowadays has multi-meg L1/L2/L3 caches). Quoting Wikipedia again (about Wii U circa 2012):

Espresso (microprocessor] wrote:

A total of 3 MB of Level 2 cache in an unusual configuration.
Core 0: 512 KB, core 1: 2 MB, core 2: 512 KB


Obviously this is to cater to the HLL crowd, which is predominant these days. I don't really blame them, HLLs are still useful (sometimes), but printf is too darn bloated! Ugh! (Of course, you aren't forced to use it, but everyone still does, out of carelessness.) Even Modula-2 (InOut.WriteInt) and Oberon (Out.Int) split out common routines for easier smartlinking vs. Pascal's "WRITE(anything)" (which is slightly more convenient to type but only barely).

Okay, so I went off on a tangent. We all want to have our cake and to eat it too. But I don't think it's unreasonable to expect the end user to be explicit about size (especially when it makes little difference, a few kb here or there) for much less work. Optimizations are great, but few algorithms are optimal anyways. Most times "good enough" is acceptable.
Post 20 Dec 2017, 23:15
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 1318
rugxulo wrote:
Size doesn't matter. I wish it did, but nobody cares. Even in the old days, almost nobody cared.
I know, I'm nobody Wink I think printf should always be dynamically linked, then at least its bloat is justified (because it's general purpose for all applications, not just yours).
Post 21 Dec 2017, 12:36
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16140
Location: Hyperborea
Furs wrote:
[I think printf should always be dynamically linked, then at least its bloat is justified (because it's general purpose for all applications, not just yours).
Except that then everyone would have a slightly different version of it and results would be inconsistent and incompatible.

Please, let's not get into the DLL hell of old again. Static linking has two main advantages that I know of 1) consistent results for your app, you don't need to rely upon someone else to have the exact same precise libraries you have, and 2) it makes it much easier to have portable apps that don't need "installing", just start using it without all the install ugliness.

It's only a few kB, don't worry about it. That's why we have GB memories, to actually use it and make things easier.
Post 21 Dec 2017, 13:11
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 1318
But printf is a "stable" function with a stable API and even adding more format specifiers to it isn't going to break anything. Also, in such situation, static linking is still not the most ideal IMO. The app can always just come with all the .dll files it needs (if it's something not supplied in the system, like msvcrt.dll is on Windows so the point is moot but whatever). Advantage is that you can manually place these .dll files so more apps share them or whatever if you want (can't do that with static linking unless you recompile it). But I guess we're getting off topic Smile

(also I love portable apps too and hate those that mess with the system, but .dlls can still be shared, my "portable" program directory has tons of apps and a directory for dlls, which is then hardlinked to each app that uses it or a launch script made to change current directory so the dll loads; the collection of apps is still 100% portable, but much more efficient, it's like using Solid Compression -- of course, such a thing is impossible to do if the app is statically linked, giving user no choice)
Post 21 Dec 2017, 15:16
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16140
Location: Hyperborea
printf implementations are inconsistent. For one bad example, the rounding for floats is different, some even do this deliberately. This is why sharing DLLs creates problems, because of different behaviours. And adding more format specifiers does break things since the behaviour changes. Have you seen those websites where they proudly show you "You saved 9.49999999999999995%"? That is because of different rounding of the printf implementation in the production machine compared to the test machine. So why is it different between the machines? Because the test machine had some other program installed that overwrote the printf DLL with a different one, or it runs a different OS where printf was coded differently, or someone reverted the printf DLL by mistake, or whatever. Rolling Eyes
Post 21 Dec 2017, 15:48
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar
Assembly Artist


Joined: 16 Jun 2003
Posts: 7005
Location: Kraków, Poland
rugxulo wrote:
So in some rare cases, size can still matter, I suppose (...).
With fasm it is possible to test things like these.

For an experiment, I made fasmg assemble an altered version of itself, where I turned off the imm8 optimization in basic x86 instruction handler (the one that implements ADD, OR, ADC, SBB, AND, SUB, XOR, CMP) by turning off the IF conditions around the line 414 in 80386.INC.

This is not huge change, the Linux version of fasmg that was 51078 bytes originally, assembled to 53326 bytes with this alteration. Windows version grew from 52736 bytes to 54784, but keep in mind the PE size is aligned to 512 byte boundary.

Now when I prepared a simple test that included X64.INC and assembled 100000 instructions, the original (smaller) version assembles it in 25.0 seconds, while the larger version needs 25.4 seconds to do the same. This is a measurable difference! Both versions are exactly the same instructions, just some of them encoded differently. But this variation in size is enough to cause a noticeable effect in overall performance.

Trying to turn off some other optimizations like short displacements in memory addressing or short relative jumps grows the code so significantly that JECXZ instructions in fasmg's source start to fail with "out of range" error, so it is then no longer even possible to test exactly the same instruction sequence. So the fact that larger larger versions turn out even slower could be blamed on that instead.
Post 21 Dec 2017, 15:50
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 1318
Well speed may not be so obvious, but size is always measurable. Whether people find it significant or not is another matter. I personally find it cool and beautiful to see lite compiled code in KBs perform the same functions that people think take tends of MBs instead because they're so used to bloat (at the same speed, if not faster, so it's not about pre-caching data)... I guess hacker culture kind of influenced my ways. Smile
Post 22 Dec 2017, 12:57
View user's profile Send private message Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2311
Location: Usono (aka, USA)
I also found this interesting:

esolangs.org/wiki/MMIX wrote:

Knuth has also defined an assembly syntax and an executable format for MMIX. The executable format doesn't support linking multiple object files (static or dynamic), you have to assemble the whole program together. The executable format is such that the assembler outputs it streaming in a single pass, so forward references and output location changes are preserved in the binary as escape codes and the operating system resolves them at program load time.
Post 22 Dec 2017, 23:06
View user's profile Send private message Visit poster's website Reply with quote
fasmnewbie



Joined: 01 Mar 2011
Posts: 547
Excellent educational material.

If I may add, it would be even more helpful if you could explain in similar amount of details the different phases of assembly process ... pre-processing, assembling, compiling etc... Can serve as a reference like for example, statements precedence, valid symbols, macro evaluations at each phases etc. People been repeating the same questions and the same confusions about these (pre-processing vs assembling etc etc).
Post 25 Feb 2018, 18:27
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2018, Tomasz Grysztar.

Powered by rwasa.