flat assembler
Message board for the users of flat assembler.
Index
> Non-x86 architectures > Supporting different targets - HOW TO |
Author |
|
shutdownall 04 Feb 2013, 18:39
I think the first step is to understand how FASM is organized in the directory structure. I refer to the version for Windows.
The main structure: Code: <EXAMPLES> <INCLUDE> <SOURCE> <TOOLS> In the <EXAMPLES> folder you will find a lot of out-of-the-box examples delivered with FASM, quite useful for development with FASM but not useful for development of a new version. The <INCLUDE> folder contains include files to be used with Windows. This directory is very important when recompiling the flatassembler program. If you ever move the installation to another directory this has to modified with the new directory for include files or (more easy) the FASMW.INI (if you use FASMW) could be simply deleted and will be created next time FASMW (or another version) is started. The <TOOLS> directory contain some additional tools based on the symbolic file format. This file contains very detailed information about the source during compiling (assembling) and for example the listing, symbol tool is based on it. All tools need the symbol file of a source which could be created with CTRL-F8 in the IDE or some command line options of the command line version. I will come back later to the listing. <SOURCE> is the main directory with the general source valid for all versions and some more target specific directories for DOS (command line version), IDE for DOS or WIN, LIBC and LINUX version and a WIN32 command line version. To recompile any version just choose the FASM source for the wanted target. In the following examples I will refer to FASMW.ASM in the <IDE/FASMW> directory. If you want to adapt a new/changed version, the source files in the <SOURCE> directory have to be changed and this will simply affect all different versions which are linked to the <SOURCE> directory for the main assembling process. |
|||
04 Feb 2013, 18:39 |
|
shutdownall 04 Feb 2013, 19:06
Here is the list of the main source files:
Code: ASSEMBLE.INC AVX.INC ERRORS.INC EXPRCALC.INC EXPRPARS.INC FORMATS.INC MESSAGES.INC PARSER.INC PREPROCE.INC SYMBDUMP.INC TABLES.INC VARIABLE.INC VERSION.INC X86_64.INC The assembling process is done in 4 steps which are: Code: * preprocessing <PREPROCE.INC> * parsing <PARSER.INC> * assembling <ASSEMBLE.INC> * formatting <FORMATS.INC> So these steps are called directly after each other and refer to the files according with their names. This is the corresponding section in source file FASM.INC which is located in the IDE directory which simply call these 4 files / functions one after each other: Code: invoke SendMessage,[hwnd_progress],PBM_SETPOS,0,0 invoke SetThreadPriority,[hthread],[compiler_priority] invoke GetTickCount mov [start_time],eax mov [preprocessing_done],0 call preprocessor invoke SendMessage,[hwnd_progress],PBM_SETPOS,1,0 or [preprocessing_done],-1 call parser invoke SendMessage,[hwnd_progress],PBM_SETPOS,2,0 call assembler invoke SendMessage,[hwnd_progress],PBM_SETPOS,3,0 call formatter invoke SendMessage,[hwnd_progress],PBM_SETPOS,4,0 call show_display_buffer invoke GetTickCount sub eax,[start_time] mov [total_time],eax The main files controlling the assembly process in detail (instruction set) are: TABLES.INC X86_64.INC AVX.INC where TABLES.INC contains all allowed symbols, directives and instructions for the specified target and X86_64.INC contains the functions used to output the binary code of the used instructions, interlinked with the instructions used in the tables. I think AVX.INC is just a new source for the AVX instructions which did not fit into the X86_64.INC. I will come back to this duo/trio of source files later. They are the most important to change/implement a new instruction set. The files EXPRCALC.INC and EXPRPARS.INC are some core files used for parsing and using expressions and these structures or functions are very helpful to use when writing own code generating functions on an own table. The ERRORS.INC and MESSAGES.INC are generating messages during compiling the source, probably helpful error messages pointing to errors in source. VARIABLE.INC contains the most used variables during compiling and can be extended with own variables as needed. VERSION.INC contains the version string. SYMBDUMP.INC creates the symbol file I think (not 100% sure) and is called from FORMATS.INC. So this is a first overview. In the next section I will concentrate more on the TABLES.INC and X86_64.INC. But the other files are helpful for defining functions or tracing data from one step to the next. |
|||
04 Feb 2013, 19:06 |
|
shutdownall 04 Feb 2013, 22:48
So the first action is to add a new instruction.
So a useful additional instruction is a trace instruction to put out some data. This is done via the output file or via the listing. I will explain later how to add the listing and combining listing and creating symbol file and to open it in the IDE in a new tab. In the TABLES.INC we add the new instruction trace. Instructions are separated in a table structure. trace has 5 digit - so it has to be added in the section instructions_5. Important is to keep an alphabetical order inside any section. So trace has to be added at the right position and is followed by an additional byte. So as we need no additional information we add a 0 as information byte. And we add a new instruction handler which is simply called trace_instruction. This should look like this: Code: db 'trace',0h dw trace_instruction-instruction_handler So the instruction trace is defined. Next we have to add the trace_instruction (function) for processing this instruction. This could be done in the X86_64.INC. Code: trace_instruction: push esi mov esi,tracebuffer mov ecx,1000h rep movsb pop esi jmp instruction_assembled This instruction writes an internal buffer for trace data into the output file when trace instuction is entered in the source. At this time only the register esi and edi are important. When the function is entered, the information byte is already loaded into register al and esi points to any additional parameters, symbols or something. If no parameters or symbols are used or allowed, esi can be left untouched. The register edi points to the output buffer (output file) and the coded instruction can directly be written with stos byte [edi]. Here we write the complete tracebuffer with size of 1000h simply in the output file. If trace is the first command in the source file, the output file can be simply viewed with a hex editor for further processing. For inspecting internal data I added a simple function which writes data into the buffer which can be called with the size of data several times. A pointer marks the position inside the buffer for subsequent writes. Code: trace_data: push esi edi ecx mov edi,tracebuffer mov ecx,[tracepos] add edi,ecx add ecx,eax mov [tracepos],ecx mov ecx,eax rep movsb pop ecx edi esi ret At last in the VARIABLE.INC we add the two variables tracebuffer and tracepos: Code: tracebuffer db 1000h dup(0) ; for debug purpose tracepos dd 0 So that was quite easy and it's a helpful instruction too for inspecting data. esi should be loaded with the address of data to inspect and eax contains the size of data to copy into the tracebuffer. |
|||
04 Feb 2013, 22:48 |
|
shutdownall 06 Feb 2013, 20:48
So now I go more in detail to the heart of flatassembler - it's the TABLES.INC file containing all allowed instructions, directives and symbols.
The instruction table is the main part of it: Code: instructions_2: db 'bt',4 dw bt_instruction-instruction_handler db 'if',0 dw if_directive-instruction_handler db 'in',0 dw in_instruction-instruction_handler ... instructions_3: ... instructions_16: db 'vaeskeygenassist',0DFh dw avx_single_source_128bit_instruction_3a_imm8-instruction_handler instructions_end: The instruction table is divided into different sections for the length of the instructions. It starts with 2 byte instructions like bt, if, in and ends with 16 byte instructions. All instructions in any section need to be alphabetical ordered. The start with the instruction name (string), an additional information byte (could be the hex code of the instruction) and two bytes for the address of the instruction handler function. Similar instructions can have the same handler and distincted with the "information byte". So there are for example simple instructions just coded with a 1 or 2 byte sequence. For example clc (clear carry flag), cld (clear direction flag), cli (clear interrupt flag), hlt (halt), pushf/popf (push/pop flag register) and so on. So the instructions are divided into type of similar instructions with same allowed operands / symbols and similar codings. For example instructions with the mod+r/m byte. The function processing the instructions is placed in the X86_64.INC or the new ones in AVX.INC. Code: instructions: dw instructions_2-instructions,(instructions_3-instructions_2)/(2+3) dw instructions_3-instructions,(instructions_4-instructions_3)/(3+3) dw instructions_4-instructions,(instructions_5-instructions_4)/(4+3) dw instructions_5-instructions,(instructions_6-instructions_5)/(5+3) dw instructions_6-instructions,(instructions_7-instructions_6)/(6+3) dw instructions_7-instructions,(instructions_8-instructions_7)/(7+3) dw instructions_8-instructions,(instructions_9-instructions_8)/(8+3) dw instructions_9-instructions,(instructions_10-instructions_9)/(9+3) dw instructions_10-instructions,(instructions_11-instructions_10)/(10+3) dw instructions_11-instructions,(instructions_12-instructions_11)/(11+3) dw instructions_12-instructions,(instructions_13-instructions_12)/(12+3) dw instructions_13-instructions,(instructions_14-instructions_13)/(13+3) dw instructions_14-instructions,(instructions_15-instructions_14)/(14+3) dw instructions_15-instructions,(instructions_16-instructions_15)/(15+3) dw instructions_16-instructions,(instructions_end-instructions_16)/(16+3) So this is the main table, stateing how many instructions are available. If the new target doesn't need so long instructions, the table can be shortened if there are only maximum 8 byte instruction names. There is only one function in PARSER.INC which processes the instructions: Code: get_instruction: push esi mov ebp,ecx call lower_case mov ecx,ebp cmp cl,16 ja no_instruction sub cl,2 jc no_instruction movzx ebx,word [z80instructions+ecx*4] add ebx,z80instructions movzx edx,word [z80instructions+ecx*4+2] scan_instructions: ... Depending on the instructions the values 16 (max size) and 2 (min size) can be adapted here. It is also possible to allow different targets / instruction sets in one version by simply copying this structure. There are more tables inside: Next important is the symbol table which contains symbols which are not to be treated as instructions, like register names, format parameters like elf,coff,pe, size parameters like byte, word, dword and so on: Code: symbols_2: db 'ah',10h,04h db 'al',10h,10h db 'ax',10h,20h db 'bh',10h,07h db 'bl',10h,13h db 'bp',10h,25h db 'bx',10h,23h db 'ch',10h,05h db 'cl',10h,11h db 'cs',10h,62h db 'cx',10h,21h db 'dh',10h,06h db 'di',10h,27h db 'dl',10h,12h db 'ds',10h,64h db 'dx',10h,22h ... symbols_3: ... symbols_11: db 'discardable',19h,25 db 'interpreter',1Eh,3 db 'notpageable',19h,27 symbols_end: Here we have two information bytes for any symbol, one is like a symbol class (e.g. 10h for register) and one individual register coding. So the functions for instruction processing can concentrate of allowed symbol values rather than checking string contents. This is done from flatassemblers parser already and makes processing easier. Code: symbols: dw symbols_1-symbols,(symbols_2-symbols_1)/(1+2) dw symbols_2-symbols,(symbols_3-symbols_2)/(2+2) dw symbols_3-symbols,(symbols_4-symbols_3)/(3+2) dw symbols_4-symbols,(symbols_5-symbols_4)/(4+2) dw symbols_5-symbols,(symbols_6-symbols_5)/(5+2) dw symbols_6-symbols,(symbols_7-symbols_6)/(6+2) dw symbols_7-symbols,(symbols_8-symbols_7)/(7+2) dw symbols_8-symbols,(symbols_9-symbols_8)/(8+2) dw symbols_9-symbols,(symbols_10-symbols_9)/(9+2) dw symbols_10-symbols,(symbols_11-symbols_10)/(10+2) dw symbols_11-symbols,(symbols_end-symbols_11)/(11+2) The get_symbol function is similar to get_instruction and scans the table: Code: get_symbol: push esi mov ebp,ecx call lower_case mov ecx,ebp cmp cl,11 ja no_symbol sub cl,1 jc no_symbol movzx ebx,word [symbols+ecx*4] add ebx,symbols movzx edx,word [symbols+ecx*4+2] scan_symbols: The symbol table can easily be extended like the instruction table or doubled for supporting different targets in one version. It is also possible to extend the addition information bytes as needed ... and of course adapting the corresponding functions. For example for the Z80 instructions I had to introduce one byte symbols for register names like A,B,C,D,E,H,L,I,R or flags like P,M,Z. There are some more tables which can be left untouched as only internal used and not affecting any target. But they could be extended as needed as well. These are: symbol_characters - main one-char-symbols operators single_operand_operators - don't know the difference to operators ? directive_operators address_sizes data_directives preprocessor_directives - processed only from the preprocessor macro_directives - processed only from the preprocessor |
|||
06 Feb 2013, 20:48 |
|
shutdownall 06 Feb 2013, 21:48
If the instructions are defined they are processed at the assembling "pass" of flatassembler and the corresponding function of the table is called to handle the instruction.
So first take a look of the simpliest instruction, called simple_instruction: Code: simple_instruction: stos byte [edi] jmp instruction_assembled We have two important registers for handling the instruction, it is the esi register poiting to the parsed source (for example to additional symbols like registers) and edi which is used to define the binary output of the instruction. Every function called from assemble_line in ASSEMBLE.INC has register al loaded with the information byte and register esi point to any additional symbols or to the end of instruction. The end of instruction is marked with 0Fh. So in the example the information byte is written to the output file and the instruction is finished. That's it and that is why it is called "simple_instruction". The function "instruction_assembled" can be found in ASSEMBLE.INC and does not more than checking if all parameters / symbols are processed by the function and assembles the next instruction - OR - otherwise stop the assembling process with error message "extra characters on line". It is recommended to end all instructions simply this way. Take a look at another function, simple_extended_instruction. It's same way very simple with just one byte code but an additional preceding byte (two byte "simple" instruction): Code: simple_extended_instruction: mov ah,al mov al,0Fh stos word [edi] jmp instruction_assembled So it is coded as 0Fh, 0A2h for the cpuid instruction. There are also prefix instructions which are placed in front of any instruction and simply put a prefix byte in the output and continue processing instructions: Code: prefix_instruction: stos byte [edi] or [prefixed_instruction],-1 jmp continue_line Typical examples are the lock prefix or the repeat prefixes for repeating instructions (rep, repe, ...). So tomorrow I will explain a more complex instruction with parameters / symbols. |
|||
06 Feb 2013, 21:48 |
|
shutdownall 10 Feb 2013, 22:48
So today is tomorrow.
Let's look at a more complex instruction with parameters. So I take the mov instruction as example. There are following move instructions (not complete): Code: mov eax,ebx (register/register) mov eax,12345678h (register/immediate) mov eax,dword [some_data] (register/memory location) mov dword [some_data],eax (memory location/register) The first one with two registers is coded: Code: 00h - (information byte, already loaded into al register) 10h,40h - symbol (register eax) 2ch - symbol (',') 10h,43h - symbol (register ebx) 0fh - (end of instruction) The second is with register / immediate is coded: Code: 00h - (information byte, already loaded into al register) 10h,40h - symbol (register eax) 2ch - symbol (',') '(',04h,78h,56h,34h,12h,')' - numerical value (12345678h) 0fh - (end of instruction) The third one with register / memory location is coded: Code: 00h - (information byte, already loaded into al register) 10h,40h - symbol (register eax) 2ch - symbol (',') 11h,04h - size operator (dword=04) '[',11h,8ah,8bh,8ch,8dh,']' - memory location - (label/dynamic symbol) 0fh - (end of instruction) The fourth one with register / memory location is coded: Code: 00h - (information byte, already loaded into al register) 11h,04h - size operator (dword=04) '[',11h,8ah,8bh,8ch,8dh,']' - memory location - (label/dynamic symbol) 2ch - symbol (',') 10h,40h - symbol (register eax) 0fh - (end of instruction) So the important thing is the end marker 0fh which marks a complete instruction. If the instruction function has finished, the esi register (pointing to the first byte after the information byte) has to be set to the end marker before leaving the instruction function with "jmp instruction_assembled" - otherwise there will appear an error message "extra characters on line". These are only basic examples, the data provided by the parser could be much more complex. I will explain more details later. For the moment it is easier to just take care of the different parameters and codings in general. All instruction data is provided by the parser which translate all symbols like registers to the values in the TABLES.INC, code all immediates like shown above and keep some dynamic symbols like labels which are resolved maybe in a later pass. The dynamic symbols (=labels) are added to an internal table where only a hash of the symbolname (=labelname) is written in the source. Such labels are marked with 11h and a four byte sequence identifying this symbol or label for future use. New symbols are added, existing symbols are referenced only. This is important to know because it does not make sense to access this internal table directly. Now let's see partly how a instruction parser (function) is realized (only partly): Code: mov_instruction: mov [base_code],88h lods byte [esi] call get_size_operator cmp al,10h je mov_reg cmp al,'[' jne invalid_operand mov_mem: call get_address push edx ebx ecx lods byte [esi] cmp al,',' jne invalid_operand lods byte [esi] call get_size_operator cmp al,'(' je mov_mem_imm cmp al,10h jne invalid_operand mov_mem_reg: lods byte [esi] cmp al,60h jb mov_mem_general_reg cmp al,70h jb mov_mem_sreg mov_mem_general_reg: So the first byte after the mov instruction is read with lods byte [esi]. After it is checked if there is a size operator specified like byte, word, dword - defined in TABLES.INC in this table: Code: address_sizes: db 4,'byte',1 db 5,'dword',4 db 5,'qword',8 db 4,'word',2 db 0 A size operator is marked with starting character 11h and followed by a byte for the size in bytes (byte=01,word=02,dword=04,qword=0. If there is a size operator specified the called function get_size_operator reads it and adjusts esi to the next parameter/symbol. So the function checks if the first parameter (after a maybe given size operator) is a register (symbol beginning with 10h) or a memory location (beginning with '[') and either jumps to subfunction for register or memory location. So the parsing is kept on and include syntax check (after register or memory location there is a symbol ',' used) and parameter check. There are some more useful functions additional to the get_size_operator which will be explained more detailed in the next lesson. |
|||
10 Feb 2013, 22:48 |
|
Tomasz Grysztar 11 Feb 2013, 07:27
The 0Fh symbol is not an "end of line" marker, it is a token than starts a new line (and it contains a pointer into the preprocessed text). If the line is last in source, it will not be followed by 0Fh, but only by 0 token (which is "end of source"marker). Please be careful with that.
Also, what you call an "information byte" is a part of instruction token, while the following symbols are just separate token. I once posted a concise list of all the symbols that fasm were using, and this thread is now listed in Important/interesing threads topic. |
|||
11 Feb 2013, 07:27 |
|
shutdownall 11 Feb 2013, 14:06
Thank you for information.
This would have been very helpful one year ago but it's okay. As I posted in the first statement here, the information from me maybe useful but however maybe incorrect in some detail as well. Maybe someone feel invited to support a different target for flatassembler - that was the goal. Yes I noticed that the last line is not ended with 0fh rather than 00h. Anyway your comments are appreciated as from other developers as well and please correct me any time when missunderstood something. There maybe more but I think this tutorial could be more helpful than no tutorial at all. Thanks for all your support so far. |
|||
11 Feb 2013, 14:06 |
|
shutdownall 11 Feb 2013, 14:22
I will post a list of more related documents to customizing flatassembler - as many as I can find.
So the information is quite scattered here at the board. |
|||
11 Feb 2013, 14:22 |
|
shutdownall 16 Feb 2013, 22:43
So one type of instruction is working with immediates. If you plan your own version of flatassembler you sure have to process immediates depending on the instructions. As shown above, immediates are internal coded in a simple way:
Example of an immediate: Code: 00h - (information byte, already loaded into al register) 10h,40h - symbol (register eax) 2ch - symbol (',') '(',04h,78h,56h,34h,12h,')' - numerical value (12345678h) 0fh - (end of instruction) Herer the immediate is a simple numeric value 12345678h. The pitfall you can run in is to decode this numerical value while checking size (first byte after '(' ) and just loading the following value. Calculations can be very complex and it is more comfortable and useful to use the internal calculations. Here are examples of complex calculations handled easily from flatassembler: Quote:
The corresponding functions are available in EXPRCALC.INC. The most useful function to process immediates is get_value which needs esi set to the start of the immediate which is indicated with '('. esi has to point to '(' after the lods byte [esi] called from get_value - probably get_value is called after finding the ',' separator. The function checks if there exists a size operator and if not it processes the immediate which could be at any complexity and returns the calculated immediate in edx:eax. Here a small piece of code of get_value: The heart is the powerful calculate_expression function. Code: get_value: mov [operand_size],0 lods byte [esi] call get_size_operator cmp al,'(' jne invalid_value mov al,[operand_size] cmp al,1 je value_byte cmp al,2 je value_word cmp al,4 je value_dword cmp al,6 je value_pword cmp al,8 je value_qword or al,al jnz invalid_value mov [value_size],al call calculate_value mov eax,[edi] mov edx,[edi+4] ret calculate_value: call calculate_expression cmp word [edi+8],0 jne invalid_value mov eax,[edi+16] mov [symbol_identifier],eax mov al,[edi+13] mov [value_sign],al mov al,[edi+12] mov [value_type],al ret The edi is used for the calculation while not touching the already generated code which is simply written with stos byte [edi] (or word or dword) but uses edi memory buffer not yet used for calculation which is simply overwritten from following instructions later. If you know which size of data is allowed in the context with registers you can simply use the subfunctions checking automatically the size and throw an error message if immediate is bigger than allowed. Code: get_byte_value -> returned in al get_word_value -> returned in ax get_dword_value -> returned in eax get_pword_value -> returned in edx:eax get_qword_value -> returned in edx:eax |
|||
16 Feb 2013, 22:43 |
|
cwpjr 20 Feb 2013, 01:20
Your service to the community is welcome!
|
|||
20 Feb 2013, 01:20 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.