flat assembler
Message board for the users of flat assembler.

Index > Compiler Internals > some questions about fas file format

Goto page 1, 2, 3  Next
Author
Thread Post new topic Reply to topic
buzzkill



Joined: 15 Mar 2009
Posts: 111
Location: the nether lands
buzzkill 20 Mar 2009, 12:54
After reading the docs on the fas file format, I decided to assemble the simplest program possible, and have a look at the resulting .fas file in a hexeditor Smile

hello2.asm
Code:
format ELF
section '.text' executable
public _start
_start:
      mov   eax, 1
      mov   ebx, 0
      int   80h    


If you assemble this (with -s option ofcourse) and load the resulting .fas file in your hexeditor, you can follow along Smile

Questions about this .fas file:
- The field "Length of section names table" (last field in the header) has value 8. I think that should be 4 right? (filesize 0x0230 - offset of section names table 0x022D + 1 = 4)
- In the preprocessed source area, it looks like an 8th line has been processed (from offset 0x0154 to 0x0164), however there are only 7 lines in the src file. If this is an 8th line, where does it come from, and if it's something else, what is it?
- In the assembly dump, in the contents of line 1 (offsets 0x0165 thru 0x0180), at offset 0x017E (Type of code) is the value 0x10, which would indicate 16-bit code. Since (on a 32-bit distro) there is only 32-bit code, I wonder if this value is correct?

General questions:
- WRT "the tokenized contents of line" (in table 3 of the docs), is there a maximum length for this structure? Since the number of tokens per line is limited, I would expect the size of the resulting structure to have a maximum as well?
- Could someone point me to some docs about calculating/creating the "extended SIB" fields? In this .fas file, they're all zero, so my hexeditor isn't much help, and I found some docs online, but they seem to use a different SIB format than fasm. Maybe the difference is in the word "extended"?

BTW Tomasz, in the docs of "Table 2 Symbol structure" there are two tables 3.1 and 3.2 mentioned, but there is only one table 3.
Post 20 Mar 2009, 12:54
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8356
Location: Kraków, Poland
Tomasz Grysztar 20 Mar 2009, 15:02
buzzkill wrote:
- The field "Length of section names table" (last field in the header) has value 8. I think that should be 4 right? (filesize 0x0230 - offset of section names table 0x022D + 1 = 4)

Seems like bug, I'll look into it later, when I'm back home.
buzzkill wrote:
- In the preprocessed source area, it looks like an 8th line has been processed (from offset 0x0154 to 0x0164), however there are only 7 lines in the src file. If this is an 8th line, where does it come from, and if it's something else, what is it?

Possibly you have an LF character at the end of 7th line, which makes an 8th, empty line.
buzzkill wrote:
- In the assembly dump, in the contents of line 1 (offsets 0x0165 thru 0x0180), at offset 0x017E (Type of code) is the value 0x10, which would indicate 16-bit code. Since (on a 32-bit distro) there is only 32-bit code, I wonder if this value is correct?

The default code setting at the beginning of assembly is 16-bit, the "format ELF" directive switches to 32-bit (or 64-bit for ELF64). So the line containing "format" directive itself is executed in 16-bit context.

buzzkill wrote:
- WRT "the tokenized contents of line" (in table 3 of the docs), is there a maximum length for this structure? Since the number of tokens per line is limited, I would expect the size of the resulting structure to have a maximum as well?

No, the number of tokens per line is not limited, neither is the length of line structure. Try doing such test:
Code:
; test.asm
db "dd 1"
repeat 700000
 db "+1"
end repeat
db 10    

Assemble "fasm test.asm test2.asm", and then "fasm test2.asm".

buzzkill wrote:
BTW Tomasz, in the docs of "Table 2 Symbol structure" there are two tables 3.1 and 3.2 mentioned, but there is only one table 3.

Oh, right, these two tables have been merged into one. The reference in table 2 needs to be corrected.

buzzkill wrote:
- Could someone point me to some docs about calculating/creating the "extended SIB" fields? In this .fas file, they're all zero, so my hexeditor isn't much help, and I found some docs online, but they seem to use a different SIB format than fasm. Maybe the difference is in the word "extended"

Table 2 already explains this:
Quote:
the first two bytes are register codes and the second two bytes are corresponding scales

The register codes are listed in the table 2.3.
So, for example, sequence of four bytes 43h, 40h, 1, 8 means the EBX+EAX*8.
Post 20 Mar 2009, 15:02
View user's profile Send private message Visit poster's website Reply with quote
buzzkill



Joined: 15 Mar 2009
Posts: 111
Location: the nether lands
buzzkill 20 Mar 2009, 15:48
Tomasz Grysztar wrote:
Possibly you have an LF character at the end of 7th line, which makes an 8th, empty line.

Yes, there is a LF at the end of the 7th line (as there should be). I take it fasm assumes this to mean that there will follow an 8th line then. Normally, with text processing programs, if the input file has 7 lines, there will be 7 lines processed and 7 lines output (assuming you generate 1 line of output for every 1 line of input). No big deal though, there wil just be an extra "line" then.

Tomasz Grysztar wrote:
The default code setting at the beginning of assembly is 16-bit, the "format ELF" directive switches to 32-bit (or 64-bit for ELF64). So the line containing "format" directive itself is executed in 16-bit context.

OK, I thought it would be something like that, and now I'm sure Smile.

Tomasz Grysztar wrote:
No, the number of tokens per line is not limited, neither is the length of line structure. Try doing such test:
Code:
; test.asm
db "dd 1"
repeat 700000
 db "+1"
end repeat
db 10    

Assemble "fasm test.asm test2.asm", and then "fasm test2.asm".

Ah, ofcourse, I had only thought of instruction lines, not data definition lines. How silly of me Embarassed (I was thinking about how to process this token structure in a program, and was wondering how much memory to malloc for it. So unfortunately no easy answer there...)

Tomasz Grysztar wrote:
Table 2 already explains this:
Quote:
the first two bytes are register codes and the second two bytes are corresponding scales

The register codes are listed in the table 2.3.
So, for example, sequence of four bytes 43h, 40h, 1, 8 means the EBX+EAX*8.

I had seen table 2.3, but didn't really put it together I guess Embarassed Your example makes it clear though (maybe an idea to put an example in the docs?) So byte 3 is the scale for the register in byte 1, and byte 4 is the scale for the register in byte 2. I guess these SIBs only come into play for the second operand of an instruction then? (something like "lea edi,[esp+ecx*4+8]" would generate a SIB for the second operand that's 0x44 0x41 0x04 0x08, and no SIB for the first operand).



BTW, one more thing: I noticed that empty lines or lines that contain only a comment don't get stripped/ignored, ie they wind up in the "preprocessed source" section. Why is this? There can't be tokens to be found in these lines, so why not just ignore them? (I thought it was common for preprocessors to strip such lines).
Post 20 Mar 2009, 15:48
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8356
Location: Kraków, Poland
Tomasz Grysztar 20 Mar 2009, 16:04
buzzkill wrote:
I had seen table 2.3, but didn't really put it together I guess Embarassed Your example makes it clear though (maybe an idea to put an example in the docs?) So byte 3 is the scale for the register in byte 1, and byte 4 is the scale for the register in byte 2. I guess these SIBs only come into play for the second operand of an instruction then? (something like "lea edi,[esp+ecx*4+8]" would generate a SIB for the second operand that's 0x44 0x41 0x04 0x08, and no SIB for the first operand).

First, what do you mean by "second operand"? You have an instruction like "mov [esp+ecx*4+8],edi". And second, those extended SIBs are used for the address values, since every instruction is assembled at some assumed address (which you can change with ORG directive). For example:
Code:
virtual at esp+ecx*8
  nop ; 1 byte
  lea eax,[$]
end virtual    

Here LEA instruction will get assembled at assumed address ESP+ECX*8+1. You will rarely see the extended SIBs in such context, because most of the code is assembled with simple numerical addresses (and that should be obvious). Second place, where you can find extended SIBs in .fas file is the symbols table, where any symbol can have the value of address, possibly with some registers added. For example:
Code:
label alpha dword at esi+edi    

is going to generate "alpha" symbol with ESI+EDI extended SIB.

buzzkill wrote:
BTW, one more thing: I noticed that empty lines or lines that contain only a comment don't get stripped/ignored, ie they wind up in the "preprocessed source" section. Why is this? There can't be tokens to be found in these lines, so why not just ignore them? (I thought it was common for preprocessors to strip such lines).

Yes, those lines are ignored - but on the next stage, by the parser. The preprocessor keeps all the lines, just for the completness (that is actually useful when reading the preprocessed source dump later). The lines that contain preprocessor's directives only, or macro invocations, also are ignored on the entry to next stage.
Post 20 Mar 2009, 16:04
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8356
Location: Kraków, Poland
Tomasz Grysztar 20 Mar 2009, 16:20
The bug with the section list size is fixed with the new release (uploading right now).

buzzkill wrote:
Normally, with text processing programs, if the input file has 7 lines, there will be 7 lines processed and 7 lines output (assuming you generate 1 line of output for every 1 line of input). No big deal though, there wil just be an extra "line" then.

But when you have LF after the each of those 7 lines, you in fact have 8 lines of text - the last one is simply empty. But you can move your caret into this last one line in text editors (at least the one I use).
Post 20 Mar 2009, 16:20
View user's profile Send private message Visit poster's website Reply with quote
buzzkill



Joined: 15 Mar 2009
Posts: 111
Location: the nether lands
buzzkill 20 Mar 2009, 16:36
Tomasz Grysztar wrote:
The bug with the section list size is fixed with the new release (uploading right now).

Thanks, I'll be downloading it shortly Smile

Tomasz Grysztar wrote:
But you can move your caret into this last one line in text editors.

Not in The One True Editor (vim), you can't Smile But I also checked it in nano, and there you're allowed to go to line 8 of a 7 line text file. But let's not start an editor war Smile (although maybe it's an idea for a poll, what editor do asm coders use, and do they have useful scripts/macros that we can copy Smile )


And I think I finally get the SIBs : they are related to addresses, only when an instruction or label is assembled at a "special" address (instead of just "the next" address) do they come into play. (I can't explain it as well as you, but still I think I got it Smile )
Post 20 Mar 2009, 16:36
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8356
Location: Kraków, Poland
Tomasz Grysztar 20 Mar 2009, 16:45
buzzkill wrote:
Not in The One True Editor (vim), you can't Smile But I also checked it in nano, and there you're allowed to go to line 8 of a 7 line text file. But let's not start an editor war Smile (although maybe it's an idea for a poll, what editor do asm coders use, and do they have useful scripts/macros that we can copy Smile )

Well... The One True Editor for fasm is the asmedit (see SOURCE\IDE subdirectory in the DOS and Windows distributions). Wink Unfortunately, there are only DOS and Windows ports of it (fasmd and fasmw) available right now. I'm planning to make a fasmx (X Windows port), too - but I cannot tell you for sure, when I'm going to start this project.

Anyway - that doesn't really need to be related to editors, I just gave them as an example. For me each LF (or CR-LF, or CR, depending on OS) starts a new line. So this:
Code:
db 31h,0Ah,32h,0Ah,33h,0Ah,34h,0Ah,35h,0Ah,36h,0Ah,37h    
defines the 7-line text, while this:
Code:
db 31h,0Ah,32h,0Ah,33h,0Ah,34h,0Ah,35h,0Ah,36h,0Ah,37h,0Ah    
defines the 8-line one.
All my tools are using this interpretation, so please be prepared for it. Wink
Post 20 Mar 2009, 16:45
View user's profile Send private message Visit poster's website Reply with quote
buzzkill



Joined: 15 Mar 2009
Posts: 111
Location: the nether lands
buzzkill 20 Mar 2009, 23:42
Two small questions about the strings table (since I'm trying to write a program to read this):
- The input filename and output filename are always the first two entries in the string table, and after those come the sections and external symbols right?
- Is there any way to tell when the "sections" part of the string table ends and the "external symbols" part begins? (Maybe everything that starts with a '.' is a section, the rest is an external symbol?)

Edit: never mind, i can just use the "Offset/Length of section names table" fields, I should learn to read...
Post 20 Mar 2009, 23:42
View user's profile Send private message Reply with quote
buzzkill



Joined: 15 Mar 2009
Posts: 111
Location: the nether lands
buzzkill 22 Mar 2009, 02:00
Hmm, it seems my above comment was valid after all... It turns out that in the strings table, sections and symbols are intermixed, ie. suppose you have an external symbol 'printf' defined in your '.text' section, then the strings table looks like this:

input filename
output filename
.text
printf
.data

and not like I thought, first the sections and then the symbols. So my original question stands: how can you tell sections and symbols apart inside the strings table? Maybe still "if it starts with a dot, it's a section"?
BTW, the ELF specs allow you to define sections that don't start with a dot...

Edit: once again I think I'm on the wrong path here, the symbol table should contain all offsets of symbols in the strings table I think. I really should stop coding/reading specs before 4am...
Post 22 Mar 2009, 02:00
View user's profile Send private message Reply with quote
buzzkill



Joined: 15 Mar 2009
Posts: 111
Location: the nether lands
buzzkill 23 Mar 2009, 15:39
In table 2.1 of the fas docs, what exactly is "the prediction" (as in eg "The prediction was needed when checking whether the symbol was used."). Also, what exactly does "The optimization adjustment is applied to the value of this symbol." mean?
Post 23 Mar 2009, 15:39
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8356
Location: Kraków, Poland
Tomasz Grysztar 23 Mar 2009, 16:10
Those flags are related to some internal workings of fasm, they are not really that important from the external point of view. I can give a detailed explanation later, if you really wish to know this. Smile
Post 23 Mar 2009, 16:10
View user's profile Send private message Visit poster's website Reply with quote
buzzkill



Joined: 15 Mar 2009
Posts: 111
Location: the nether lands
buzzkill 23 Mar 2009, 17:45
No that's OK, I was just wondering what those things meant Smile I'll just limit myself to bits 0-3 for the moment Smile
Post 23 Mar 2009, 17:45
View user's profile Send private message Reply with quote
buzzkill



Joined: 15 Mar 2009
Posts: 111
Location: the nether lands
buzzkill 23 Mar 2009, 20:50
Yet another question: in table 2, the "Value of symbol.", how is this calculated? Because I have eg a src line that says "number dd 3", and in the symbol table, the symbol 'number' gets value 0x3B, so it's obviously not just the assigned numerical value, ie 3 in this case.
Also, when defining strings, like eg "string db 'my very very long string',10,0", you couldn't fit all that into a qword, so obviously there must be some way of reducing the string to a value that fits into a qword.

There's nothing about this in the docs, but I'm sure you can enlighten me, Tomasz Smile
Post 23 Mar 2009, 20:50
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 23 Mar 2009, 21:15
Tomasz probably will give the full answer later but in the meantime: perhaps in the case of "number dd 3" the "value" is the symbol address and when you do "number = 3" then its "value" will effectively be 3?
Post 23 Mar 2009, 21:15
View user's profile Send private message Reply with quote
buzzkill



Joined: 15 Mar 2009
Posts: 111
Location: the nether lands
buzzkill 23 Mar 2009, 21:37
I think that's it, LocoDelAssembly, thanks! I just counted, and all the offsets of my variables in the .data section match up to the "symbol value". I guess I was thrown off by the name, if it would have been called "symbol offset" or "symbol address" or something I would have thought of this myself I think. Anyway, thanks for the help.
Post 23 Mar 2009, 21:37
View user's profile Send private message Reply with quote
buzzkill



Joined: 15 Mar 2009
Posts: 111
Location: the nether lands
buzzkill 25 Mar 2009, 15:30
Question about the "Symbol structure" (table 2): what is an anonymous symbol? I haven't been able to find one in my own (simple) programs' fas files unfortunately.
Also, about the second-last field of the Symbol structure, when would the offset be into the strings table (ie high bit set)? In my own programs' fas files, all symbol names are offsets into the preprocessed source (ie high bit cleared).
Post 25 Mar 2009, 15:30
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8356
Location: Kraków, Poland
Tomasz Grysztar 25 Mar 2009, 16:26
buzzkill wrote:
Question about the "Symbol structure" (table 2): what is an anonymous symbol? I haven't been able to find one in my own (simple) programs' fas files unfortunately.

It's @@ label. See the last paragraph of section 1.2.3 in manual.

buzzkill wrote:
Also, about the second-last field of the Symbol structure, when would the offset be into the strings table (ie high bit set)? In my own programs' fas files, all symbol names are offsets into the preprocessed source (ie high bit cleared).

This happens when the label has a constructed name, which doesn't occur directly in the source. NASM-style locals are example:
Code:
some:
 .a: ; this defines label with name "some.a"    
Post 25 Mar 2009, 16:26
View user's profile Send private message Visit poster's website Reply with quote
buzzkill



Joined: 15 Mar 2009
Posts: 111
Location: the nether lands
buzzkill 25 Mar 2009, 16:53
Thanks for the prompt reply, Tomasz, that clears it up for me. BTW, since the name of a symbol is stored in a pascal-style string in the preprocessed source, I assume that means the maximum identifier (symbol) length is 255? I don't think that's mentioned in the manual anywhere.
Post 25 Mar 2009, 16:53
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8356
Location: Kraków, Poland
Tomasz Grysztar 25 Mar 2009, 17:34
You're right, the manual doesn't mention it. I treat it more like a limitation of implementation, not language. Wink Though with .fas his limitation became fortified.

BTW, you might also want to take a look at never-finished guide to fasm's internals.
Post 25 Mar 2009, 17:34
View user's profile Send private message Visit poster's website Reply with quote
buzzkill



Joined: 15 Mar 2009
Posts: 111
Location: the nether lands
buzzkill 25 Mar 2009, 18:04
Thanks for that link Tomasz, I'm definitely going to read that guide as well.

OT: I'm having an annoying problem with the forum: Often I don't get an email notification about a new post, like eg your last one here (but I did receive an email for your previous one before that). When I check the thread, I have to click "watch this topic for replies" again, even though I have "always receive mail notification" on in my profile, and I never click on "stop watching thread". Am I doing something wrong or...?
Post 25 Mar 2009, 18:04
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2, 3  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.