flat assembler
Message board for the users of flat assembler.
Index
> Assembly > Types of multi-pass assembly |
Author |
|
Tomasz Grysztar 11 Dec 2017, 22:08
As any user of fasm probably knows, it is a multi-pass assembler and the basic function of the multiple passes is to optimize generated code so that uses short forms of the instructions whenever possible (fasm's multi-pass with its unrestricted forward-referencing can do much more, but that is a story for a different day).
What might be less commonly known is that there are actually different kinds of multi-pass assembly and the one that fasm is using is not the same as the one usually described when talking about this topic. The initial pass of multi-pass assembly is done to get at least a rough idea of the sizes/positions of all instructions. When fasm does it, it uses the optimistic assumption that all instructions should have their shortest possible form. In the next pass, assembler already has a first draft of all the values that may affect sizes of generated instruction, and it may turn out that the optimistic assumption was too optimistic, and some parts of the code must be enlarged. This in turn changes some of the addresses and assembler needs another pass to find out if even more instructions have to be larger than initially assumed. It repeats this process until all the values are stable and only then generated code is valid. Because if during the last pass all the addresses stayed the same as they were in the pass before it, the instructions generated from the values from the previous pass turn out correct. A more traditional approach was to use pessimistic assumptions instead of fasm's optimistic ones. That is, every instruction is at first generated large and is shrunk only if in the later pass it turns out that the range of addresses allows it to be shorter. An advantage of pessimistic multi-pass is that it usually can be ended after any of the intermediate passes and all the instructions can be updated with the addresses collected during the last of these passes to create the correct code. The instructions were generated pessimistically, so every one of them should be large enough to hold the correct values. The optimistic variant has the intermediate passes where some instructions are too small to hold the right values and this code cannot be fixed in such a simple way. The assembler has to continue with further passes until it reaches a solution. Note that both the pessimistic and optimistic multi-pass may fail to find the solution in specific circumstances, like what I call the oscillator problem (a pessimistic assembler would usually fail in a different manner, by completely disallowing some types of use of forward-referenced values). But this does not happen when optimizing things like jumps in x86 architecture, where shrinking one instruction never forces any other instruction to be enlarged or vice versa. An advantage of optimistic multi-pass is that it is able to find solutions that are smaller. This can be demonstrated with a simple example: Code: a: jmp c b: db 124 dup 90h jmp a db 90h c: But we can do more, it is in fact possible to use fasm to test and compare both approaches to multi-pass assembly. First, we need some code for testing. I'm going to generate an artificial one, consisting only of jump instructions, with help of fasmg. I put the following code into TEST.ASM: Code: format binary as 'inc' xorshift = 11111111h struc rand32 . = xorshift shr 96 . = ( . xor (. shl 11) xor (. shr 8) xor (xorshift) xor (xorshift shr 19) ) and 0FFFFFFFFh xorshift = (. or xorshift shl 32) and 0FFFFFFFF_FFFFFFFF_FFFFFFFFh end struc struc rand limit while 1 . rand32 . = . and (1 shl (bsr (limit) + 1) - 1) if . <= limit break end if end while end struc NUMBER_OF_JUMPS = 10000 repeat NUMBER_OF_JUMPS db 'position',`%,': ' R rand 100 repeat 1, target:(% + R) mod NUMBER_OF_JUMPS + 1 db 'jmp position',`target,10 end repeat end repeat Code: position1: jmp position53 position2: jmp position20 position3: jmp position97 position4: jmp position56 position5: jmp position6 position6: jmp position83 position7: jmp position42 position8: jmp position60 ... Now, to test optimistic multi-pass we could try just assembling it with fasm. But to get a better view of what is happening, we are going to implement the jump instruction as a macro. This is something that should work the same with fasm 1 and fasmg, only the basic macro syntax (braces for fasm 1, "end macro" for fasmg) needs to be changed accordingly. I'm going to use the fasm 1 syntax here: Code: macro jmp target { if ~ defined target | ( target-($+2) < 128 & target-($+2) >= -128 ) db 0EBh,target-($+1) else db 0E9h dw target-($+2) end if } include 'test.inc' This generates 24772 bytes in 8 passes. Now we can alter the macro to make a pessimistic variant: Code: macro jmp target { if ~ defined target | target-($+2) >= 128 | target-($+2) < -128 db 0E9h dw target-($+2) else db 0EBh,target-($+1) end if } include 'test.inc' Keep in mind that in both cases fasm's label prediction algorithms are applied when the distance to the target is computed and another implementation could give a different results (see below). But fasmg gives the same outcome as fasm 1, they are mostly compatible at this level. What about terminating the pessimistic multi-pass prematurely to keep the partially optimized code as a good-enough solution? This also can be emulated with a variation of the above macros: Code: NUMBER_OF_PASSES = 2 macro jmp target { local LARGE if ~ defined target | target-($+2) >= 128 | target-($+2) < -128 | (PASS = LAST_PASS & LARGE) db 0E9h dw target-($+2) LARGE = 1 else db 0EBh,target-($+1) LARGE = 0 end if } include 'test.inc' LAST_PASS = NUMBER_OF_PASSES - 1 if ~ defined PASS PASS = 1 else if PASS >= LAST_PASS PASS = LAST_PASS else PASS = PASS + 1 end if Now this variant starts pessimistically and keeps optimizing the instructions until the penultimate pass is reached. At that point it keeps the instruction sizes as they were and in the final pass the assembler fills them with the correct values. With NUMBER_OF_PASSES = 2 this does not optimize the instructions at all. In the second pass it just fills the fields with right values and the code is ready. With NUMBER_OF_PASSES = 3 generated size is 25695. With 4 passes it is 25006, with 5 - 24858, with 6 - 24829, with 7 - 24825 and at this moment it stabilizes. Further passes give no improvement. This shows how with pessimistic variant we get smaller and smaller output and we can take any intermediate stage and still have a correct, just less optimized code. It is never as small as with the optimistic multi-pass, though. One more thing to add is that in the pessimistic multi-pass the last of the passes does not really need to go over all the instructions again. In the previous pass it could have created a table of relocations - offsets of all the fields within code that contain addresses (either relative or absolute), and information what addresses have to be put there. Then instead of the final pass one can just process this relocation table and update all the values in code accordingly. The early assemblers did not even do that, they just stored this complete relocation table in the generated object file and it was the linker that later processed it (this was saving time). So the NUMBER_OF_PASSES = 2 setting in the above sample did in fact simulate a single-pass assembler, because what the second pass was doing the same job as simply applying the relocation table. Last edited by Tomasz Grysztar on 07 Sep 2020, 08:27; edited 2 times in total |
|||
11 Dec 2017, 22:08 |
|
Tomasz Grysztar 12 Dec 2017, 09:13
There are many x86 instructions that have longer and shorter forms, it mostly comes down to the choice between 8-bit or larger displacement or immediate. But normally only the ones that contain a relative offset in such field require multiple passes, and they are mostly branches. But if a programmer uses a relative offset as a data for arithmetic instruction, such instruction is also going to behave like a jump during the optimization:
Code: add ecx, end - start I have already mentioned earlier that the reason why x86 optimization does not have the problem of "oscillations" is that under normal circumstances making an instruction smaller cannot cause any instruction to become larger, because when relative offsets decrease the instruction may only get smaller. Similarly, making an instruction larger cannot make any instruction smaller, as the relative offsets only grow. Thanks to this both the pessimistic and optimistic approach have no problem reaching the solution every time. With fasm's unrestricted forward-referencing you can write some self-dependencies that may create "size oscillation" even with x86 instructions, but this would count as highly unusual circumstance. But when you have an architecture where changing size of an instruction may cause the opposite shift elsewhere (or even in the very same instruction, like in the case of alignment-dependent encodings) then this problem has to be taken into account. |
|||
12 Dec 2017, 09:13 |
|
rugxulo 20 Dec 2017, 23:15
I hope I'm not intruding or wasting your time here. I'm of limited input.
Back when I recently converted PSR Invaders to various assemblers (from TASM's MASM mode), I noticed that a lot of assemblers didn't fully handle forward references or size optimizations. A86 is one-pass and only optimizes backwards jumps for size. NASM 0.97 didn't do any optimizations at all (strictly two-pass only), presumably to conserve memory, so the workaround was probably to manually tell it exactly what size you wanted (if it truly mattered, which it often doesn't). Later NASM did get -Ox and made it default. YASM always had multi-pass optimizations (but it was buggy for a while). SmallerC guy swears by YASM since it's much faster for him. Even TASM supposedly didn't get multi-pass until 2.0, and the default INVADERS.COM binary forgot to enable it! So it's 9356 vs. 9194 bytes (which is apparently one extra 512-byte sector wasted, heheh). And yet UPX compresses them almost to the same size anyways, go figure. Obviously older computers and assemblers were just slow (EDIT: see here at bottom). Requiring a linker sounds bad, but a lot of people insist upon modularity, separate libraries, etc. Also, memory was somewhat scarce, but it's always a tradeoff. Size doesn't matter. I wish it did, but nobody cares. Even in the old days, almost nobody cared. One of the worst culprits is non-smart linkers and bloated routines like printf (which in latest DJGPP is ridiculously large, comparatively), but you can also blame static linking for that. Actually, size can still matter (with small cpu cache), but modern Intel/AMD are more copious than the old days. Although I never had a Mac, I found an interesting mention on Wikipedia: PowerPC wrote:
So in some rare cases, size can still matter, I suppose (although I assume even PowerPC nowadays has multi-meg L1/L2/L3 caches). Quoting Wikipedia again (about Wii U circa 2012): Espresso (microprocessor] wrote:
Obviously this is to cater to the HLL crowd, which is predominant these days. I don't really blame them, HLLs are still useful (sometimes), but printf is too darn bloated! Ugh! (Of course, you aren't forced to use it, but everyone still does, out of carelessness.) Even Modula-2 (InOut.WriteInt) and Oberon (Out.Int) split out common routines for easier smartlinking vs. Pascal's "WRITE(anything)" (which is slightly more convenient to type but only barely). Okay, so I went off on a tangent. We all want to have our cake and to eat it too. But I don't think it's unreasonable to expect the end user to be explicit about size (especially when it makes little difference, a few kb here or there) for much less work. Optimizations are great, but few algorithms are optimal anyways. Most times "good enough" is acceptable. |
|||
20 Dec 2017, 23:15 |
|
Furs 21 Dec 2017, 12:36
rugxulo wrote: Size doesn't matter. I wish it did, but nobody cares. Even in the old days, almost nobody cared. |
|||
21 Dec 2017, 12:36 |
|
revolution 21 Dec 2017, 13:11
Furs wrote: [I think printf should always be dynamically linked, then at least its bloat is justified (because it's general purpose for all applications, not just yours). Please, let's not get into the DLL hell of old again. Static linking has two main advantages that I know of 1) consistent results for your app, you don't need to rely upon someone else to have the exact same precise libraries you have, and 2) it makes it much easier to have portable apps that don't need "installing", just start using it without all the install ugliness. It's only a few kB, don't worry about it. That's why we have GB memories, to actually use it and make things easier. |
|||
21 Dec 2017, 13:11 |
|
Furs 21 Dec 2017, 15:16
But printf is a "stable" function with a stable API and even adding more format specifiers to it isn't going to break anything. Also, in such situation, static linking is still not the most ideal IMO. The app can always just come with all the .dll files it needs (if it's something not supplied in the system, like msvcrt.dll is on Windows so the point is moot but whatever). Advantage is that you can manually place these .dll files so more apps share them or whatever if you want (can't do that with static linking unless you recompile it). But I guess we're getting off topic
(also I love portable apps too and hate those that mess with the system, but .dlls can still be shared, my "portable" program directory has tons of apps and a directory for dlls, which is then hardlinked to each app that uses it or a launch script made to change current directory so the dll loads; the collection of apps is still 100% portable, but much more efficient, it's like using Solid Compression -- of course, such a thing is impossible to do if the app is statically linked, giving user no choice) |
|||
21 Dec 2017, 15:16 |
|
revolution 21 Dec 2017, 15:48
printf implementations are inconsistent. For one bad example, the rounding for floats is different, some even do this deliberately. This is why sharing DLLs creates problems, because of different behaviours. And adding more format specifiers does break things since the behaviour changes. Have you seen those websites where they proudly show you "You saved 9.49999999999999995%"? That is because of different rounding of the printf implementation in the production machine compared to the test machine. So why is it different between the machines? Because the test machine had some other program installed that overwrote the printf DLL with a different one, or it runs a different OS where printf was coded differently, or someone reverted the printf DLL by mistake, or whatever.
|
|||
21 Dec 2017, 15:48 |
|
Tomasz Grysztar 21 Dec 2017, 15:50
rugxulo wrote: So in some rare cases, size can still matter, I suppose (...). For an experiment, I made fasmg assemble an altered version of itself, where I turned off the imm8 optimization in basic x86 instruction handler (the one that implements ADD, OR, ADC, SBB, AND, SUB, XOR, CMP) by turning off the IF conditions around the line 414 in 80386.INC. This is not huge change, the Linux version of fasmg that was 51078 bytes originally, assembled to 53326 bytes with this alteration. Windows version grew from 52736 bytes to 54784, but keep in mind the PE size is aligned to 512 byte boundary. Now when I prepared a simple test that included X64.INC and assembled 100000 instructions, the original (smaller) version assembles it in 25.0 seconds, while the larger version needs 25.4 seconds to do the same. This is a measurable difference! Both versions are exactly the same instructions, just some of them encoded differently. But this variation in size is enough to cause a noticeable effect in overall performance. Trying to turn off some other optimizations like short displacements in memory addressing or short relative jumps grows the code so significantly that JECXZ instructions in fasmg's source start to fail with "out of range" error, so it is then no longer even possible to test exactly the same instruction sequence. So the fact that larger larger versions turn out even slower could be blamed on that instead. |
|||
21 Dec 2017, 15:50 |
|
Furs 22 Dec 2017, 12:57
Well speed may not be so obvious, but size is always measurable. Whether people find it significant or not is another matter. I personally find it cool and beautiful to see lite compiled code in KBs perform the same functions that people think take tends of MBs instead because they're so used to bloat (at the same speed, if not faster, so it's not about pre-caching data)... I guess hacker culture kind of influenced my ways.
|
|||
22 Dec 2017, 12:57 |
|
rugxulo 22 Dec 2017, 23:06
I also found this interesting:
esolangs.org/wiki/MMIX wrote:
|
|||
22 Dec 2017, 23:06 |
|
fasmnewbie 25 Feb 2018, 18:27
Excellent educational material.
If I may add, it would be even more helpful if you could explain in similar amount of details the different phases of assembly process ... pre-processing, assembling, compiling etc... Can serve as a reference like for example, statements precedence, valid symbols, macro evaluations at each phases etc. People been repeating the same questions and the same confusions about these (pre-processing vs assembling etc etc). |
|||
25 Feb 2018, 18:27 |
|
Tomasz Grysztar 28 Apr 2019, 19:49
There is now a continuation of this article, in Pitfalls of optimistic multi-pass assembly.
|
|||
28 Apr 2019, 19:49 |
|
Tomasz Grysztar 29 Apr 2019, 13:50
It is also worth noting that prediction techniques are important in optimistic multi-pass assembly. They can affect not only the number of passes required to reach the solution, but sometimes even the quality of the final result. While developing fasm I initially underestimated the effect of shifting predictions, but I had to change my mind after I saw that they can make a notable difference.
NASM, which also uses an optimistic multi-pass approach, assembles the samples generated above with the same results as fasm. But the algorithms of these assemblers are not completely equivalent, as demonstrable with this slightly modified generator: Code: format binary as 'inc' xorshift = 1338300172 struc rand32 . = xorshift shr 96 . = ( . xor (. shl 11) xor (. shr 8) xor (xorshift) xor (xorshift shr 19) ) and 0FFFFFFFFh xorshift = (. or xorshift shl 32) and 0FFFFFFFF_FFFFFFFF_FFFFFFFFh end struc struc rand limit while 1 . rand32 . = . and (1 shl (bsr (limit) + 1) - 1) if . <= limit break end if end while end struc TOTAL = 20000 repeat TOTAL db 'position',`%,': ' R rand 77 S rand 77 repeat 1, first: (TOTAL + % + R - 37) mod TOTAL + 1, second: (TOTAL + % + S - 37) mod TOTAL + 1 db 'add esi,position',`first,'-position',`second,10 end repeat end repeat Code: position1: add esi,position19977-position5 position2: add esi,position31-position20000 position3: add esi,position39-position19983 position4: add esi,position24-position27 position5: add esi,position19989-position12 position6: add esi,position40-position18 position7: add esi,position17-position22 position8: add esi,position19999-position38 ... |
|||
29 Apr 2019, 13:50 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.