flat assembler
Message board for the users of flat assembler.

Index > Compiler Internals > Some minor fasm deficiencies.

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
l_inc



Joined: 23 Oct 2009
Posts: 881
l_inc
Good day. From time to time I notice some problems with fasm and collect corresponding information into a text file for a later report. Now the time to report came. Thus here are my notes:

1. I know, that in some cases semantics of the size modifiers is to produce a longer instruction encoding. And sometimes is to choose the corresponding addressing mode. Here are some examples of that behaviour beeing seemly inconsistent:
Code:
use32
add eax, [dword 0] ;choosing addressing mode. OK.
add eax, [word 0]  ;choosing addressing mode. OK.
use64
add rax, [qword 0] ;WTF? This does not exist, but still compiles. Not OK.
lea rax, [qword 0] ;same problem. Not OK.
mov rbx, [qword 0] ;same problem. Not OK.

use32
;Is there a way to generate 0xFFC0 at all in 32 bit mode?
inc dword eax ;not documented. However expected long opcode analogous to e.g. push    


2. In some cases fasm fails to compile correct instructions because of a failed optimization attempt.
Code:
use64
org 0
;Obviously there's an attempt to generate a rip-relative address,
;but it fails. The instruction however still exists. Thus failed
;optimization should not result in failed compilation.
mov rax, [$123456789] ;does not compile without qword. Not OK.

;This one is even worse. There are 2 ways to encode the instruction,
;but fasm does not find any of those.
org 100000000h
lea edx,[0]        ;fails to compile. Not OK.

;absolute addressing enforced:
org 100000000h
lea edx,[dword 0]  ;OK. But not the expected encoding for the previous one.

;eip-relative addressing enforced:
org 100000000h
lea edx,[eip - @F and 0xFFFFFFFF + 0] ;kinda perverse way. This is the expected encoding for lea edx,[0]
@@:    


3. This one is rather a documentation lack. The order of preprocessing is undefined when symbols with identical names are used for different purposes. The most critical case seems to be this one:
Code:
struc x .
{
        local .,.
                display `.,13,10
}
a x b    

What would one expect to be displayed here? Empirically it seems like the struct name a is overridden by the argument b, which is in turn overridden by the first local symbol . and then by the second local symbol . . Additional identically named local symbols will override the previously defined. But I didn't find any documentation on that.

4. Try to build symbols for the following one (fasm asm.asm -s asm.fas). Btw. the option -s also does not seem to be documented in the pdf.
Code:
"""aaaaaaaa    


5. I don't know if it's a documentation lack, but the behaviour of dotted numeric constants is unclear. I acutally didn't expect the following one to fail:
Code:
display .world
hello:
.world = 'a'    

The documentation states, that numeric constants are "very much like labels", but does not really define the borders of this "very much like". For example, "global" numeric constants do not become prefixes for dotted labels or numeric constants. Why then dotted numeric constants are prefixed by "global" labels?

6. The registry key numeric constants for the 64 bit configuration (kernel64.inc) are defined incorrectly. Generous MS allowed those incorrect values to work as well, but actually those should be defined like this:
Code:
HKEY_CLASSES_ROOT     = 80000000h or -80000000h
HKEY_CURRENT_USER     = 80000001h or -80000000h
...    

Even though the incorrect constants work, a code using them becomes larger than needed.

7. Current code generation mode should not affect the load/store directives:
Code:
org $100000000
use32
dd 'xxxx'
load xxxx dword from $-4 ;fails to compile    

_________________
Faith is a superposition of knowledge and fallacy
Post 03 Jul 2013, 14:24
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 7756
Location: Kraków, Poland
Tomasz Grysztar
l_inc wrote:
1. I know, that in some cases semantics of the size modifiers is to produce a longer instruction encoding. And sometimes is to choose the corresponding addressing mode. Here are some examples of that behaviour beeing seemly inconsistent:
Code:
use32
add eax, [dword 0] ;choosing addressing mode. OK.
add eax, [word 0]  ;choosing addressing mode. OK.
use64
add rax, [qword 0] ;WTF? This does not exist, but still compiles. Not OK.
lea rax, [qword 0] ;same problem. Not OK.
mov rbx, [qword 0] ;same problem. Not OK.
    

In long mode it is choosing addressing mode, the same. Please compare the results of these two instructions in long mode:
Code:
        lea     rax,[dword -1]          ; RAX := 0x00000000FFFFFFFF
        lea     rax,[qword -1]          ; RAX := 0xFFFFFFFFFFFFFFFF    

l_inc wrote:
Code:
use32
;Is there a way to generate 0xFFC0 at all in 32 bit mode?
inc dword eax ;not documented. However expected long opcode analogous to e.g. push    
fasm's assembly language focuses primarily on what the instruction does, not how it is encoded. The idea was that the assembly language should be abstraction from machine code, allowing to focus on what operations should be performed, while leaving the process of finding a good (possibly the best) encoding up to the assembler.

There is however one relic in fasm's syntax that breaks this rule - when there is a size prefix before the immediate value, it enforces the long encoding of that immediate (which otherwise would get optimized, so possibly shortened). Keeping this feature does not cause much problems primarily because the size operator if needed, is usually applied to the main operand, not immediate, so the case when there is size prefix before immediate can be utilized for such a special purpose without causing any harm (as long as you are aware that such feature exists). However I'm not happy with this feature, as it breaks the general rule of abstraction that I chose to follow stricly in other places. I planned to have a very different way of dealing with this in fasm 2, but that's a different story...

So you can enforce a long immediate encoding by putting a (most probably superfluous) size prefix next to that value, but it applies only to this one specific case. This feature was there to help with some self-modifying code where you'd like to make sure that the immediate value that you want to modify is full-size, to hold any possible value you'd like to put there. It does not apply to instructions like "inc dword eax", as there is no immediate there.
Post 03 Jul 2013, 15:41
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 7756
Location: Kraków, Poland
Tomasz Grysztar
l_inc wrote:
2. In some cases fasm fails to compile correct instructions because of a failed optimization attempt.

This is most probably related to this discussion: http://board.flatassembler.net/topic.php?t=5313
I haven't yet analyzed the specific cases you listed, though. Perhaps there are some additional bugs hidden there.
Post 03 Jul 2013, 15:45
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 7756
Location: Kraków, Poland
Tomasz Grysztar
l_inc wrote:
5. I don't know if it's a documentation lack, but the behaviour of dotted numeric constants is unclear. I acutally didn't expect the following one to fail:
Code:
display .world
hello:
.world = 'a'    

The documentation states, that numeric constants are "very much like labels", but does not really define the borders of this "very much like". For example, "global" numeric constants do not become prefixes for dotted labels or numeric constants. Why then dotted numeric constants are prefixed by "global" labels?[/code]
What would one expect to be displayed here? Empirically it seems like the struct name a is overridden by the argument b, which is in turn overridden by the first local symbol . and then by the second local symbol . . Additional identically named local symbols will override the previously defined. But I didn't find any documentation on that.
Originally the NASM-like local spaces were affected the same by both labels and numeric variables. This was later changed due to users' request, so that the numeric variable would not create a new locals prefix - but all the rest stayed the same - the prefix is applied to ANY symbol that starts with dot and is used anywhere in the parsed source code. It does not matter in what context the symbol is used, whether it is getting defined or used, or not used at all. If the documentation lacks in this case (and it may be because this feature has been changed a bit during the development), I will try to make it more clear.
Post 03 Jul 2013, 15:51
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 7756
Location: Kraków, Poland
Tomasz Grysztar
l_inc wrote:
3. This one is rather a documentation lack. The order of preprocessing is undefined when symbols with identical names are used for different purposes. The most critical case seems to be this one:
Code:
struc x .
{
        local .,.
                display `.,13,10
}
a x b    

What would one expect to be displayed here? Empirically it seems like the struct name a is overridden by the argument b, which is in turn overridden by the first local symbol . and then by the second local symbol . . Additional identically named local symbols will override the previously defined. But I didn't find any documentation on that.
Details like this are documented in the section 2.3.7 of official manual, and some additional notes may be found in the Understanding fasm article (well, there would still be more, if I ever finished it). I will check out whether it lacks some of the details you requested here.
Post 03 Jul 2013, 15:55
View user's profile Send private message Visit poster's website Reply with quote
l_inc



Joined: 23 Oct 2009
Posts: 881
l_inc
Tomasz Grysztar
Quote:
In long mode it is choosing addressing mode, the same. Please compare the results of these two instructions in long mode

Seems like my bad. Sorry. You are right.

Quote:
fasm's assembly language focuses primarily on what the instruction does, not how it is encoded.

I know your position about controlling the instruction encoding. But you really can't claim, that code pieces of different sizes are functionally equivalent. I always thought, that this was the reason you allowed to control at least the instruction length (like with immediates).
What if an instruction is on a page boundary and the next page would be not executable? What about all those multibyte nop's specifically introduced by Intel for code alignment? Disregarding their sizes those are all functionally equivalent. Would you want to optimize those out? Or why woudn't you optimize xchg rbx,rbx into a nop?
I mean code size optimization is a good thing, but for an assembler compiler it absolutely has to be controllable.

Quote:
However I'm not happy with this feature, as it breaks the general rule of abstraction that I chose to follow stricly in other places.

This statement of yours reminds of a discussion I had some time ago. The short summary of the discussion: "%t breaks the SSSO principle violently". I don't mean, it's bug. %t is a good feature. But this observation applies to a discussion we had here.

Code:
There is however one relic in fasm's syntax that breaks this rule    

I just can't disregard this statement. This "relic" is actually of very high importance.

Quote:
This is most probably related to this discussion

I don't think so. Compilation problems described there are related to forward referencing labels. In my examples, fasm can't compile just normal instructions.

Quote:
This was later changed due to users' request, so that the numeric variable would not create a new locals prefix

Fair enough. I fully support this decision. It just doesn't seem to be documented.

Quote:
Details like this are documented in the section 2.3.7 of official manual

Yes. I know. "Like this". But I didn't find "exactly this" one. Smile

_________________
Faith is a superposition of knowledge and fallacy
Post 03 Jul 2013, 16:49
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 7756
Location: Kraków, Poland
Tomasz Grysztar
l_inc wrote:
What if an instruction is on a page boundary and the next page would be not executable? What about all those multibyte nop's specifically introduced by Intel for code alignment? Disregarding their sizes those are all functionally equivalent. Would you want to optimize those out?
Any such details that concern not what the code does, but how is it laid out in memory, should be appropriately expressed in terms of abstracted language - like with ALIGN (I chose to have it macro-extended, because there are too many variations on how one would like to have his code alignment filled). At this level of abstraction you never should really need to know the exact encoding of instructions - you provide the guidelines for assembler and it tries to find the best solution.

l_inc wrote:
Or why woudn't you optimize xchg rbx,rbx into a nop?
That's actually an interesting question: where to stop with optimization? This was discussed extensively here: http://board.flatassembler.net/topic.php?t=1238
I myself changed my mind a few times about it.

l_inc wrote:
Compilation problems described there are related to forward referencing labels. In my examples, fasm can't compile just normal instructions.
I meant that it may be related to the fact that fasm cannot guarantee to find the solution in each and every case. But still, there may be something hidden here that can be fixed, I need to investigate.
Post 03 Jul 2013, 17:21
View user's profile Send private message Visit poster's website Reply with quote
l_inc



Joined: 23 Oct 2009
Posts: 881
l_inc
Tomasz Grysztar
Quote:
you provide the guidelines for assembler and it tries to find the best solution

Please, don't overabstract. Finding best solutions is for high level language compilers. Assembler's main purpose is to provide the highest level of control over the output, while giving as much as possible of coding simplification features. Control is superior, convenience is important, but inferior. For situations, where the convenience is of higher importance, there is a plenty of high level languages and highly sophisticated optimizers for them.

_________________
Faith is a superposition of knowledge and fallacy
Post 03 Jul 2013, 18:20
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 7756
Location: Kraków, Poland
Tomasz Grysztar
l_inc wrote:
Assembler's main purpose is to provide the highest level of control over the output, while giving as much as possible of coding simplification features.
That may be the definition that you chose, but for me the assembly was primarily the abstraction that I mentioned. The code resolving idea (which is based on the programmer defining some abstracted boundaries for the solution that assembler has to generate) was at the core of fasm since the beginning.

If there is something worth having control over, usually there is some abstract reason for it, which can be extracted and formulated in a more general way. It is much better to say "I want to have that instruction aligned to paragraph boundary" than to say "I want this instruction to be five bytes long" - the first one shows the true reason behind the request, while the second one may obscure the actual purpose.
Post 03 Jul 2013, 21:46
View user's profile Send private message Visit poster's website Reply with quote
l_inc



Joined: 23 Oct 2009
Posts: 881
l_inc
Tomasz Grysztar
Quote:
That may be the definition that you chose, but for me the assembly was primarily the abstraction that I mentioned.

OK. But would you then clarify your understanding of the difference between an assembler and high level languages? Because abstracting from what some code looks like to what some code does is applicable to any high level language: you don't need to know the actual instructions if you just somehow specify (e.g., draw a picture), what you want those to do.

Quote:
The code resolving idea [...] was at the core of fasm since the beginning.

Yes. This is the convenience part. But it does not and must not prevent from being able to specify an exact (and maybe suboptimal from your point of view) solution in case one needs that exact solution for whatever unexpressible reasons. And this is the control part. Size operators are an example for that.

Quote:
It is much better to say "I want to have that instruction aligned to paragraph boundary" than to say "I want this instruction to be five bytes long" - the first one shows the true reason behind the request

Yes. For that reason people came up with high level languages. And the abstractioning process is definitely not at it's final point and not even near to it. Cause it's much better to literally say "I want the computer to understand my language" than to type all those mysterious processor instructions which make the computer understand human speech.

_________________
Faith is a superposition of knowledge and fallacy
Post 04 Jul 2013, 13:12
View user's profile Send private message Reply with quote
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall
Quote:

OK. But would you then clarify your understanding of the difference between an assembler and high level languages?


The difference is that you use native machine *commands*. Commands, not encodings. I agree with Tomasz, programmer should express the intent using native machine commands (mnemonics) and assembler should choose the most optimal encodings.
Post 04 Jul 2013, 14:31
View user's profile Send private message Visit poster's website Reply with quote
l_inc



Joined: 23 Oct 2009
Posts: 881
l_inc
randall
Quote:
The difference is that you use native machine *commands*. Commands, not encodings.

Machines natively work with instruction encodings. What you talk about are mnemonics, which are already a step towards higher level concepts: same mnemonic can have different functionality in different contexts; different mnemonics can have same functionality.
Besides, definition of the word "mnemonic" is flexible. Are push and push dword the same mnemonic?
Quote:
programmer should express the intent with native machine commands (mnemonics) and assembler should choose the most optimal encodings

Would you accept an assembler to compile a nop instead of xchg rbx,rbx? If not, why do you accept push dword 1 to be compiled into a short form push 1 without giving you any possibility to specify the longer form?

_________________
Faith is a superposition of knowledge and fallacy
Post 04 Jul 2013, 14:49
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 7756
Location: Kraków, Poland
Tomasz Grysztar
l_inc wrote:
Besides, definition of the word "mnemonic" is flexible. Are push and push dword the same mnemonic?
I don't know any other definition of mnemonic than the one that states that it is that first word of assembly command, namely "push" in both these cases.

l_inc wrote:
Would you accept an assembler to compile a nop instead of xchg rbx,rbx?
In the discussion about optimization of the TEST instruction you can see how the opinions on this topic evolved. From my point of view such optimization is acceptable when performed operation on registers, memory, flags is identical - since this is really not much different from the case when assembler has many variants of how to encode, say, MOV instruction. BUT, I decided to generally not implement optimizations that would cause the instruction in disassembly to look completely different from the one in source - because they may cause confusion, and are a bit controversial.

l_inc wrote:
If not, why do you accept push dword 1 to be compiled into a short form push 1 without giving you any possibility to specify the longer form?
The meaning of "dword" operator there is primarily the size of value that gets stored on the stack. This is the most important aspect, because when writing in assembly, the main focus is on what the instruction should do (that is: push 32-bit value and not the 16-bit one), leaving the choice of encoding to assembler.
With that being said, fasm still allows to use this syntax variant to enforce the long immediate (and that's why I also had to add mnemonics like PUSHD). I added it solely for the purposes of SMC. But I'm not really satisfied with this solution, and for fasm 2 I planned to separate the assembly instruction syntax from the additional hints for the encoder, which would be specified as annotations beside the main instruction. This would make both worlds happy, I hope.
Post 04 Jul 2013, 17:31
View user's profile Send private message Visit poster's website Reply with quote
l_inc



Joined: 23 Oct 2009
Posts: 881
l_inc
Tomasz Grysztar
Quote:
I don't know any other definition of mnemonic than the one that states that it is that first word of assembly command

It actually does not matter, if you consider it as a single or multiple tokens. Those become an encoded part of the instruction anyway. E.g. the processor documentation does not specify different mnemonics for (differently behaving) instructions CC and CD 03. Therefore you had to extend the conventional syntax making the '3' a part of the first token. The point is that if you claim to be willing to abstract from the encoding, you actually could do it the opposite way: int3 for CD 03 and int 3 for CC. Therefore it does not make sense to say like "this first token is a mnemonic and everything else is not": you can always make it a part of the mnemonic by either changing the syntax or extending the definition of the word "mnemonic".

Quote:
BUT, I decided to generally not implement optimizations that would cause the instruction in disassembly to look completely different from the one in source

That's a very weak argumentation, because there are different syntaxes and every developer of a yet another disassembler invents something new. Even disregarding AT&T syntax disassembly, you could again consider the int3 example or, if it's not enough "completely different" for you, consider the opposite case: same instruction and different mnemonics. Like if I write xchg eax,eax I will see something completely different in a disassembler, right?

Quote:
I added it solely for the purposes of SMC.

Yes, you said that before. But SMC is one use case, that kinda justifies the size modifiers. You experienced such a need for SMC and you added this. Don't you assume, there could be other use cases that require control for size?
One example would be if I want to recompile some program by disassembling it into fasm syntax, change smth and then without any additional overhead to compile it again with fasm, so that the program works. If you don't allow for size control, then some instructions of the original program could become shorter and the whole program will change, which in turn may make it unusable.

Quote:
I planned to separate the assembly instruction syntax from the additional hints for the encoder, which would be specified as annotations beside the main instruction.

Sounds really nice. I'm a little confused about the word "hint"? Does it mean, that I could provide a hint and the encoder would still ignore it for the sake of optimization?

_________________
Faith is a superposition of knowledge and fallacy
Post 04 Jul 2013, 20:56
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 7756
Location: Kraków, Poland
Tomasz Grysztar
l_inc wrote:
E.g. the processor documentation does not specify different mnemonics for (differently behaving) instructions CC and CD 03. Therefore you had to extend the conventional syntax making the '3' a part of the first token. The point is that if you claim to be willing to abstract from the encoding, you actually could do it the opposite way: int3 for CD 03 and int 3 for CC.
I created two different mnemonics, because they are actually two different instructions - they differ a bit in what they do, this is not simply a matter of choice between longer or shorter encoding. Still, the mnemonic is always just a first word (it is reflected in fasm's internal architecture, too), everything that follows is an "operand" of some kind.

l_inc wrote:
Sounds really nice. I'm a little confused about the word "hint"? Does it mean, that I could provide a hint and the encoder would still ignore it for the sake of optimization?
No, I only thought about the case when you specified a hint "encode immediate as 32-bit value", but instruction had no immediate at all - then encoder would ignore the "hint". But I did not had in mind ignoring the annotations for the sake of optimization - the annotations would have the priority.
Post 04 Jul 2013, 21:10
View user's profile Send private message Visit poster's website Reply with quote
l_inc



Joined: 23 Oct 2009
Posts: 881
l_inc
Tomasz Grysztar
Quote:
I created two different mnemonics, because they are actually two different instructions - they differ a bit in what they do, this is not simply a matter of choice between longer or shorter encoding.

Well... the exact purpose of the remark "(differently behaving)" in my previous post was to avoid occurrence of this explanation. Smile
Quote:
I only thought about the case when you specified a hint "encode immediate as 32-bit value", but instruction had no immediate at all

It seems like you have some cases in mind, where the presence of an immediate is not obvious for the programmer. Otherwise this situation is rather a subject to fail compilation and to report an error.

_________________
Faith is a superposition of knowledge and fallacy
Post 04 Jul 2013, 21:15
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 7756
Location: Kraków, Poland
Tomasz Grysztar
l_inc wrote:
It seems like you have some cases in mind, where the presence of an immediate is not obvious for the programmer. Otherwise this situation is rather a subject to fail compilation and to report an error.
The idea was, that the annotations would never create an error by themselves. The parts that would have no meaning with a given instruction would simply be ignored. And then you could also apply the same annotation to a whole block of instructions, and the encoding rules would get applied where possible, and ignored where meaningless.
Post 05 Jul 2013, 08:33
View user's profile Send private message Visit poster's website Reply with quote
l_inc



Joined: 23 Oct 2009
Posts: 881
l_inc
Tomasz Grysztar
Some side remarks on the idea:

Maybe it would make sense to introduce strong and weak annotations (maybe weak for block annotations and strong for single instruction annotations). At least I never liked a compiler to silently ignore what I write. Additionally block syntax always has larger coding overhead (when applied to a single instruction/line) than single instruction or single line effect syntax.

_________________
Faith is a superposition of knowledge and fallacy
Post 05 Jul 2013, 10:29
View user's profile Send private message Reply with quote
l_inc



Joined: 23 Oct 2009
Posts: 881
l_inc
Here's one more bug (2. 4. and 7. seem to be clearly compiler bugs).

8. Stores from a different addressing space into reserved data must not be thrown away from the output:
Code:
rb 1
space::
org 0
store byte 'A' at space:$$    

_________________
Faith is a superposition of knowledge and fallacy
Post 07 Jul 2013, 22:26
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 7756
Location: Kraków, Poland
Tomasz Grysztar
l_inc wrote:

Code:
use64
org 0
;Obviously there's an attempt to generate a rip-relative address,
;but it fails. The instruction however still exists. Thus failed
;optimization should not result in failed compilation.
mov rax, [$123456789] ;does not compile without qword. Not OK.    
This one is actually working as intended. It is defined that fasm always generates RIP-relative addressing unless the absolute addressing is enforced (with size operator). This is important, so that when writing a regular code you can expect that everything is RIP-relative and thus relocatable.

What could be improved here, is an error message - perhaps fasm should tell that it cannot generate RIP-relative address and hint that instruction may be compilable by adding the absolute addressing enforcing.
Post 09 Jul 2013, 09:38
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on YouTube, Twitter.

Website powered by rwasa.