Types of multi-pass assembly

Index > Assembly > Types of multi-pass assembly

Author

Thread

Tomasz Grysztar

Joined: 16 Jun 2003
Posts: 8434
Location: Kraków, Poland

Tomasz Grysztar 11 Dec 2017, 22:08

As any user of fasm probably knows, it is a multi-pass assembler and the basic function of the multiple passes is to optimize generated code so that uses short forms of the instructions whenever possible (fasm's multi-pass with its unrestricted forward-referencing can do much more, but that is a story for a different day).

What might be less commonly known is that there are actually different kinds of multi-pass assembly and the one that fasm is using is not the same as the one usually described when talking about this topic.

The initial pass of multi-pass assembly is done to get at least a rough idea of the sizes/positions of all instructions. When fasm does it, it uses the optimistic assumption that all instructions should have their shortest possible form. In the next pass, assembler already has a first draft of all the values that may affect sizes of generated instruction, and it may turn out that the optimistic assumption was too optimistic, and some parts of the code must be enlarged. This in turn changes some of the addresses and assembler needs another pass to find out if even more instructions have to be larger than initially assumed. It repeats this process until all the values are stable and only then generated code is valid. Because if during the last pass all the addresses stayed the same as they were in the pass before it, the instructions generated from the values from the previous pass turn out correct.

A more traditional approach was to use pessimistic assumptions instead of fasm's optimistic ones. That is, every instruction is at first generated large and is shrunk only if in the later pass it turns out that the range of addresses allows it to be shorter.

An advantage of pessimistic multi-pass is that it usually can be ended after any of the intermediate passes and all the instructions can be updated with the addresses collected during the last of these passes to create the correct code. The instructions were generated pessimistically, so every one of them should be large enough to hold the correct values.

The optimistic variant has the intermediate passes where some instructions are too small to hold the right values and this code cannot be fixed in such a simple way. The assembler has to continue with further passes until it reaches a solution.

Note that both the pessimistic and optimistic multi-pass may fail to find the solution in specific circumstances, like what I call the oscillator problem (a pessimistic assembler would usually fail in a different manner, by completely disallowing some types of use of forward-referenced values). But this does not happen when optimizing things like jumps in x86 architecture, where shrinking one instruction never forces any other instruction to be enlarged or vice versa.

An advantage of optimistic multi-pass is that it is able to find solutions that are smaller. This can be demonstrated with a simple example:

Code:

a:
        jmp     c
b:
        db      124 dup 90h
        jmp     a
        db      90h
c:

When a pessimistic assembler at first generates the larger form of the first jump (not knowing the distance to the target), it also forces the second jump to use the long form (because the distance back to "a" becomes larger than 128 bytes), and this in turn causes the distance from "b" to "c" to become unreachable via a short jump. Following this path the assembler no longer has an option to make the first jump a short one. An optimistic assembler first tries the short jump and then finds out that it is enough.

But we can do more, it is in fact possible to use fasm to test and compare both approaches to multi-pass assembly.

First, we need some code for testing. I'm going to generate an artificial one, consisting only of jump instructions, with help of fasmg. I put the following code into TEST.ASM:

Code:

format binary as 'inc'

xorshift = 11111111h

struc rand32
        . = xorshift shr 96
        . = ( . xor (. shl 11) xor (. shr 8) xor (xorshift) xor (xorshift shr 19) ) and 0FFFFFFFFh
        xorshift = (. or xorshift shl 32) and 0FFFFFFFF_FFFFFFFF_FFFFFFFFh
end struc

struc rand limit
        while 1
                . rand32
                . = . and (1 shl (bsr (limit) + 1) - 1)
                if . <= limit
                        break
                end if
        end while
end struc

NUMBER_OF_JUMPS = 10000

repeat NUMBER_OF_JUMPS
        db 'position',`%,': '
        R rand 100
        repeat 1, target:(% + R) mod NUMBER_OF_JUMPS + 1
                db 'jmp position',`target,10
        end repeat
end repeat

and assemble it with fasmg to get TEST.INC that looks like this:

Code:

position1: jmp position53
position2: jmp position20
position3: jmp position97
position4: jmp position56
position5: jmp position6
position6: jmp position83
position7: jmp position42
position8: jmp position60 
...

The jumps are randomized, but mostly stay within a distance of 100 instructions, this should result in a mix of short and near jumps and a sufficient challenge for an assembler. I set the initial seed for the PRNG to a constant, so that the same test would be generated every time. It can be changed to some other value (even %t) to generate different tests.

Now, to test optimistic multi-pass we could try just assembling it with fasm. But to get a better view of what is happening, we are going to implement the jump instruction as a macro. This is something that should work the same with fasm 1 and fasmg, only the basic macro syntax (braces for fasm 1, "end macro" for fasmg) needs to be changed accordingly. I'm going to use the fasm 1 syntax here:

Code:

macro jmp target
{
        if ~ defined target | ( target-($+2) < 128 & target-($+2) >= -128 )
                db 0EBh,target-($+1)
        else
                db 0E9h
                dw target-($+2)
        end if
}

include 'test.inc'

In the initial pass a short jump is generated because of the "~ defined target" condition. In the consecutive passes the updated distance to the target is checked and the size of instruction is chosen appropriately.

This generates 24772 bytes in 8 passes.

Now we can alter the macro to make a pessimistic variant:

Code:

macro jmp target
{
        if ~ defined target | target-($+2) >= 128 | target-($+2) < -128
                db 0E9h
                dw target-($+2)
        else
                db 0EBh,target-($+1)
        end if
}

include 'test.inc'

This gives 24825 bytes in 7 passes. In my tests sometimes one method takes more passes and sometimes the other one, but usually it is the pessimistic one that requires less of them.

Keep in mind that in both cases fasm's label prediction algorithms are applied when the distance to the target is computed and another implementation could give a different results (see below). But fasmg gives the same outcome as fasm 1, they are mostly compatible at this level.

What about terminating the pessimistic multi-pass prematurely to keep the partially optimized code as a good-enough solution? This also can be emulated with a variation of the above macros:

Code:

NUMBER_OF_PASSES = 2

macro jmp target
{
        local LARGE
        if ~ defined target | target-($+2) >= 128 | target-($+2) < -128 | (PASS = LAST_PASS & LARGE)
                db 0E9h
                dw target-($+2)
                LARGE = 1
        else
                db 0EBh,target-($+1)
                LARGE = 0
        end if
}

include 'test.inc'

LAST_PASS = NUMBER_OF_PASSES - 1

if ~ defined PASS
        PASS = 1
else if PASS >= LAST_PASS
        PASS = LAST_PASS
else
        PASS = PASS + 1
end if

The PASS and LAST_PASS variables play tricks with fasm's engine in order to count and limit the number of passes.

Now this variant starts pessimistically and keeps optimizing the instructions until the penultimate pass is reached. At that point it keeps the instruction sizes as they were and in the final pass the assembler fills them with the correct values.

With NUMBER_OF_PASSES = 2 this does not optimize the instructions at all. In the second pass it just fills the fields with right values and the code is ready.

With NUMBER_OF_PASSES = 3 generated size is 25695. With 4 passes it is 25006, with 5 - 24858, with 6 - 24829, with 7 - 24825 and at this moment it stabilizes. Further passes give no improvement.

This shows how with pessimistic variant we get smaller and smaller output and we can take any intermediate stage and still have a correct, just less optimized code. It is never as small as with the optimistic multi-pass, though.

One more thing to add is that in the pessimistic multi-pass the last of the passes does not really need to go over all the instructions again. In the previous pass it could have created a table of relocations - offsets of all the fields within code that contain addresses (either relative or absolute), and information what addresses have to be put there. Then instead of the final pass one can just process this relocation table and update all the values in code accordingly. The early assemblers did not even do that, they just stored this complete relocation table in the generated object file and it was the linker that later processed it (this was saving time). So the NUMBER_OF_PASSES = 2 setting in the above sample did in fact simulate a single-pass assembler, because what the second pass was doing the same job as simply applying the relocation table.

Last edited by Tomasz Grysztar on 07 Sep 2020, 08:27; edited 2 times in total

11 Dec 2017, 22:08

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20690
Location: In your JS exploiting you and your system

revolution 12 Dec 2017, 00:55

For THUMB code fasmarm is optimistic also, but I changed the behaviour to be "sticky". The first pass will allocate the smaller instruction. If a future pass finds that the larger instruction is needed then that will remain larger for all following passes. It can be thought of as a "sticky" allocation. I found that this significantly reduced the number of times I saw the error "code cannot be generated" while still producing optimal code in almost all cases.

This "sticky" approach becomes especially important when the IT instruction is available. There are a lot of opportunities to choose either a long or a short encoding for almost all instructions, not just branches, so the number of permutations expands greatly and it could take many years to perform enough passes to finally produce a working output.

12 Dec 2017, 00:55

Tomasz Grysztar

Joined: 16 Jun 2003
Posts: 8434
Location: Kraków, Poland

Tomasz Grysztar 12 Dec 2017, 09:13

There are many x86 instructions that have longer and shorter forms, it mostly comes down to the choice between 8-bit or larger displacement or immediate. But normally only the ones that contain a relative offset in such field require multiple passes, and they are mostly branches. But if a programmer uses a relative offset as a data for arithmetic instruction, such instruction is also going to behave like a jump during the optimization:

Code:

        add     ecx, end - start

When sizes of some instructions change, this alters the relative offsets and may in turn change the length of an instruction like the above one.

I have already mentioned earlier that the reason why x86 optimization does not have the problem of "oscillations" is that under normal circumstances making an instruction smaller cannot cause any instruction to become larger, because when relative offsets decrease the instruction may only get smaller. Similarly, making an instruction larger cannot make any instruction smaller, as the relative offsets only grow. Thanks to this both the pessimistic and optimistic approach have no problem reaching the solution every time.

With fasm's unrestricted forward-referencing you can write some self-dependencies that may create "size oscillation" even with x86 instructions, but this would count as highly unusual circumstance. But when you have an architecture where changing size of an instruction may cause the opposite shift elsewhere (or even in the very same instruction, like in the case of alignment-dependent encodings) then this problem has to be taken into account.

12 Dec 2017, 09:13

rugxulo

Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)

rugxulo 20 Dec 2017, 23:15

I hope I'm not intruding or wasting your time here. I'm of limited input.

Back when I recently converted PSR Invaders to various assemblers (from TASM's MASM mode), I noticed that a lot of assemblers didn't fully handle forward references or size optimizations. A86 is one-pass and only optimizes backwards jumps for size. NASM 0.97 didn't do any optimizations at all (strictly two-pass only), presumably to conserve memory, so the workaround was probably to manually tell it exactly what size you wanted (if it truly mattered, which it often doesn't). Later NASM did get -Ox and made it default. YASM always had multi-pass optimizations (but it was buggy for a while). SmallerC guy swears by YASM since it's much faster for him. Even TASM supposedly didn't get multi-pass until 2.0, and the default INVADERS.COM binary forgot to enable it! So it's 9356 vs. 9194 bytes (which is apparently one extra 512-byte sector wasted, heheh). And yet UPX compresses them almost to the same size anyways, go figure.

Obviously older computers and assemblers were just slow (EDIT: see here at bottom). Requiring a linker sounds bad, but a lot of people insist upon modularity, separate libraries, etc. Also, memory was somewhat scarce, but it's always a tradeoff.

Size doesn't matter. I wish it did, but nobody cares. Even in the old days, almost nobody cared. One of the worst culprits is non-smart linkers and bloated routines like printf (which in latest DJGPP is ridiculously large, comparatively), but you can also blame static linking for that.

Actually, size can still matter (with small cpu cache), but modern Intel/AMD are more copious than the old days. Although I never had a Mac, I found an interesting mention on Wikipedia:

PowerPC wrote:

Apple tried to use the 603 in a new laptop design but was unable due to the small 8 KiB level 1 cache. The 68000 emulator in the Mac OS could not fit in 8 KiB and thus slowed the computer drastically. The 603e solved this problem by having a 16 KiB L1 cache which allowed the emulator to run efficiently.

So in some rare cases, size can still matter, I suppose (although I assume even PowerPC nowadays has multi-meg L1/L2/L3 caches). Quoting Wikipedia again (about Wii U circa 2012):

Espresso (microprocessor] wrote:

A total of 3 MB of Level 2 cache in an unusual configuration.
Core 0: 512 KB, core 1: 2 MB, core 2: 512 KB

Obviously this is to cater to the HLL crowd, which is predominant these days. I don't really blame them, HLLs are still useful (sometimes), but printf is too darn bloated! Ugh! (Of course, you aren't forced to use it, but everyone still does, out of carelessness.) Even Modula-2 (InOut.WriteInt) and Oberon (Out.Int) split out common routines for easier smartlinking vs. Pascal's "WRITE(anything)" (which is slightly more convenient to type but only barely).

Okay, so I went off on a tangent. We all want to have our cake and to eat it too. But I don't think it's unreasonable to expect the end user to be explicit about size (especially when it makes little difference, a few kb here or there) for much less work. Optimizations are great, but few algorithms are optimal anyways. Most times "good enough" is acceptable.

20 Dec 2017, 23:15

Furs

Joined: 04 Mar 2016
Posts: 2652

Furs 21 Dec 2017, 12:36

rugxulo wrote:

Size doesn't matter. I wish it did, but nobody cares. Even in the old days, almost nobody cared.

I know, I'm nobody Wink

I think printf should always be dynamically linked, then at least its bloat is justified (because it's general purpose for all applications, not just yours).

21 Dec 2017, 12:36

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20690
Location: In your JS exploiting you and your system

revolution 21 Dec 2017, 13:11

Furs wrote:

[I think printf should always be dynamically linked, then at least its bloat is justified (because it's general purpose for all applications, not just yours).

Except that then everyone would have a slightly different version of it and results would be inconsistent and incompatible.

Please, let's not get into the DLL hell of old again. Static linking has two main advantages that I know of 1) consistent results for your app, you don't need to rely upon someone else to have the exact same precise libraries you have, and 2) it makes it much easier to have portable apps that don't need "installing", just start using it without all the install ugliness.

It's only a few kB, don't worry about it. That's why we have GB memories, to actually use it and make things easier.

21 Dec 2017, 13:11

Furs

Joined: 04 Mar 2016
Posts: 2652

Furs 21 Dec 2017, 15:16

But printf is a "stable" function with a stable API and even adding more format specifiers to it isn't going to break anything. Also, in such situation, static linking is still not the most ideal IMO. The app can always just come with all the .dll files it needs (if it's something not supplied in the system, like msvcrt.dll is on Windows so the point is moot but whatever). Advantage is that you can manually place these .dll files so more apps share them or whatever if you want (can't do that with static linking unless you recompile it). But I guess we're getting off topic Smile

(also I love portable apps too and hate those that mess with the system, but .dlls can still be shared, my "portable" program directory has tons of apps and a directory for dlls, which is then hardlinked to each app that uses it or a launch script made to change current directory so the dll loads; the collection of apps is still 100% portable, but much more efficient, it's like using Solid Compression -- of course, such a thing is impossible to do if the app is statically linked, giving user no choice)

21 Dec 2017, 15:16

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20690
Location: In your JS exploiting you and your system

revolution 21 Dec 2017, 15:48

printf implementations are inconsistent. For one bad example, the rounding for floats is different, some even do this deliberately. This is why sharing DLLs creates problems, because of different behaviours. And adding more format specifiers does break things since the behaviour changes. Have you seen those websites where they proudly show you "You saved 9.49999999999999995%"? That is because of different rounding of the printf implementation in the production machine compared to the test machine. So why is it different between the machines? Because the test machine had some other program installed that overwrote the printf DLL with a different one, or it runs a different OS where printf was coded differently, or someone reverted the printf DLL by mistake, or whatever. Rolling Eyes

21 Dec 2017, 15:48

Tomasz Grysztar

Joined: 16 Jun 2003
Posts: 8434
Location: Kraków, Poland

Tomasz Grysztar 21 Dec 2017, 15:50

rugxulo wrote:

So in some rare cases, size can still matter, I suppose (...).

With fasm it is possible to test things like these.

For an experiment, I made fasmg assemble an altered version of itself, where I turned off the imm8 optimization in basic x86 instruction handler (the one that implements ADD, OR, ADC, SBB, AND, SUB, XOR, CMP) by turning off the IF conditions around the line 414 in 80386.INC.

This is not huge change, the Linux version of fasmg that was 51078 bytes originally, assembled to 53326 bytes with this alteration. Windows version grew from 52736 bytes to 54784, but keep in mind the PE size is aligned to 512 byte boundary.

Now when I prepared a simple test that included X64.INC and assembled 100000 instructions, the original (smaller) version assembles it in 25.0 seconds, while the larger version needs 25.4 seconds to do the same. This is a measurable difference! Both versions are exactly the same instructions, just some of them encoded differently. But this variation in size is enough to cause a noticeable effect in overall performance.

Trying to turn off some other optimizations like short displacements in memory addressing or short relative jumps grows the code so significantly that JECXZ instructions in fasmg's source start to fail with "out of range" error, so it is then no longer even possible to test exactly the same instruction sequence. So the fact that larger larger versions turn out even slower could be blamed on that instead.

21 Dec 2017, 15:50

Furs

Joined: 04 Mar 2016
Posts: 2652

Furs 22 Dec 2017, 12:57

Well speed may not be so obvious, but size is always measurable. Whether people find it significant or not is another matter. I personally find it cool and beautiful to see lite compiled code in KBs perform the same functions that people think take tends of MBs instead because they're so used to bloat (at the same speed, if not faster, so it's not about pre-caching data)... I guess hacker culture kind of influenced my ways. Smile

22 Dec 2017, 12:57

rugxulo

Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)

rugxulo 22 Dec 2017, 23:06

I also found this interesting:

esolangs.org/wiki/MMIX wrote:

Knuth has also defined an assembly syntax and an executable format for MMIX. The executable format doesn't support linking multiple object files (static or dynamic), you have to assemble the whole program together. The executable format is such that the assembler outputs it streaming in a single pass, so forward references and output location changes are preserved in the binary as escape codes and the operating system resolves them at program load time.

22 Dec 2017, 23:06

fasmnewbie

Joined: 01 Mar 2011
Posts: 555

fasmnewbie 25 Feb 2018, 18:27

Excellent educational material.

If I may add, it would be even more helpful if you could explain in similar amount of details the different phases of assembly process ... pre-processing, assembling, compiling etc... Can serve as a reference like for example, statements precedence, valid symbols, macro evaluations at each phases etc. People been repeating the same questions and the same confusions about these (pre-processing vs assembling etc etc).

25 Feb 2018, 18:27

Tomasz Grysztar

Joined: 16 Jun 2003
Posts: 8434
Location: Kraków, Poland

Tomasz Grysztar 28 Apr 2019, 19:49

There is now a continuation of this article, in Pitfalls of optimistic multi-pass assembly.

28 Apr 2019, 19:49

Tomasz Grysztar

Joined: 16 Jun 2003
Posts: 8434
Location: Kraków, Poland

Tomasz Grysztar 29 Apr 2019, 13:50

It is also worth noting that prediction techniques are important in optimistic multi-pass assembly. They can affect not only the number of passes required to reach the solution, but sometimes even the quality of the final result. While developing fasm I initially underestimated the effect of shifting predictions, but I had to change my mind after I saw that they can make a notable difference.

NASM, which also uses an optimistic multi-pass approach, assembles the samples generated above with the same results as fasm. But the algorithms of these assemblers are not completely equivalent, as demonstrable with this slightly modified generator:

Code:

format binary as 'inc'

xorshift = 1338300172

struc rand32
        . = xorshift shr 96
        . = ( . xor (. shl 11) xor (. shr 8) xor (xorshift) xor (xorshift shr 19) ) and 0FFFFFFFFh
        xorshift = (. or xorshift shl 32) and 0FFFFFFFF_FFFFFFFF_FFFFFFFFh
end struc

struc rand limit
        while 1
                . rand32
                . = . and (1 shl (bsr (limit) + 1) - 1)
                if . <= limit
                        break
                end if
        end while
end struc

TOTAL = 20000

repeat TOTAL
        db 'position',`%,': '
        R rand 77
        S rand 77
        repeat 1, first: (TOTAL + % + R - 37) mod TOTAL + 1, second: (TOTAL + % + S - 37) mod TOTAL + 1
                db 'add esi,position',`first,'-position',`second,10
        end repeat
end repeat

Again, fasmg is needed to process this script and generate a 20000-line challenge source that looks like:

Code:

position1: add esi,position19977-position5
position2: add esi,position31-position20000
position3: add esi,position39-position19983
position4: add esi,position24-position27
position5: add esi,position19989-position12
position6: add esi,position40-position18
position7: add esi,position17-position22
position8: add esi,position19999-position38
...

This test assembles to 109604 bytes with fasm (and also with fasmg when the standard x86 macro package is used), but to 109628 with NASM (2.14.03rc2). While they both use an optimistic method, there is a discrepancy in the results because of differences in the details of resolving algorithm (like the formula of making predictions).

29 Apr 2019, 13:50

< Last Thread | Next Thread >

Forum Rules:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum