[fasmg.x86] the long road ahead

Index > Programming Language Design > [fasmg.x86] the long road ahead

Author

Thread

Mike Gonta

Joined: 26 Dec 2010
Posts: 246

Mike Gonta 29 Apr 2017, 13:29

As a proof of concept I've added (the relatively easy) x86 nop instruction to the fasmg core so that it assembles
that instruction's opcode (0x90).
I added this line to tables.inc symbols:

Code:

  db 3,'nop',SYMCLASS_INSTRUCTION,VALTYPE_NATIVE_COMMAND,VAL_INTERNAL,1
  dd x86_nop

which is the same as the 'db' (define_data) entry with a x86_nop handler:

Code:

x86_nop:
  pusha
  mov ecx, 1
  call initialize_output
  mov byte [edi], 0x90
  popa
  jmp instruction_assembled

_________________
Mike Gonta
look and see - many look but few see

https://mikegonta.com

Last edited by Mike Gonta on 29 Apr 2017, 22:10; edited 2 times in total

29 Apr 2017, 13:29

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20904
Location: In your JS exploiting you and your system

revolution 29 Apr 2017, 13:32

But x86 nop is more than just "db 0x90", it can also have arguments to give longer no-ops. Plus it is processor specific. Other CPUs can also use the nop opcode but have entirely different binary outputs.

29 Apr 2017, 13:32

Mike Gonta

Joined: 26 Dec 2010
Posts: 246

Mike Gonta 29 Apr 2017, 14:11

revolution wrote:

But x86 nop is more than just "db 0x90", it can also have arguments to give longer no-ops. Plus it is processor
specific. Other CPUs can also use the nop opcode but have entirely different binary outputs.

Of course, but this would be for an x86 specific fasmg version which would use an api that Tomasz hinted at here.

_________________
Mike Gonta
look and see - many look but few see

https://mikegonta.com

29 Apr 2017, 14:11

Tomasz Grysztar

Joined: 16 Jun 2003
Posts: 8513
Location: Kraków, Poland

Tomasz Grysztar 29 Apr 2017, 14:20

You do not need to preserve all the registers, in fact the only register that matters when jumping to "instruction_assembled" is ESI.
You may find some info about the instruction handler interface in DIRECTIVES.INC around line 290.

29 Apr 2017, 14:20

Mike Gonta

Joined: 26 Dec 2010
Posts: 246

Mike Gonta 29 Apr 2017, 15:01

Tomasz Grysztar wrote:

You do not need to preserve all the registers, in fact the only register that matters when jumping to "instruction_assembled" is ESI.
You may find some info about the instruction handler interface in DIRECTIVES.INC around line 290.

Thank you.
I'll start digging for the rest of the api.

http://mikegonta.com/fasmg.x86

_________________
Mike Gonta
look and see - many look but few see

https://mikegonta.com

29 Apr 2017, 15:01

Tomasz Grysztar

Joined: 16 Jun 2003
Posts: 8513
Location: Kraków, Poland

Tomasz Grysztar 02 May 2018, 09:34

This is probably the best place to write down some notes that could become a guide to how to create an assembler on top of fasmg engine. So let me expand on the example you made.

In the instruction tables used by fasm 1 there was a place for an additional parameter passed to the handler, so one common handler could be used for multiple instructions that used the same rules for operands and encoding. In case of fasmg we also have a field that could be used for this purpose:

Code:

  db 3,'nop',SYMCLASS_INSTRUCTION,VALTYPE_NATIVE_COMMAND,VAL_INTERNAL,90h
  dd simple_instruction

  db 4,'int3',SYMCLASS_INSTRUCTION,VALTYPE_NATIVE_COMMAND,VAL_INTERNAL,0CCh
  dd simple_instruction

In fasm 1 the parameter was passed in AL, in case of fasmg we need to dig it out ourselves:

Code:

simple_instruction:
        mov     al,[edx+ValueDefinition.attribute]
        push    eax
        mov     ecx,1
        call    initialize_output
        pop     eax
        stosb
        jmp     instruction_assembled

I labeled the handler with the same name that was used in fasm 1 for this purpose, there are many analogies between them.

You should pay attention to definitions of all the interfaces. The fact that EDX has a pointer to ValueDefinition is stated in the definition of "instruction handler" interface (I already mentioned it in a post above):

Code:

; instruction handler
; in:
;  esi = pointer into preprocessed line
;  ecx = number of whitespace tokens between previous symbol and current position
;  edx - ValueDefinition of instruction
;  ebx - SymbolTree_Leaf of instruction
;  edi - SymbolTree_Root of instruction
;  when [SymbolTree_Leaf.class] = SYMCLASS_STRUCTURE:
;   [label_leaf] - SymbolTree_Leaf of structure label
;   [label_branch] - SymbolTree_Foliage of structure label
; out:
;  when done, handler should jump to instruction_assembled with esi containing a pointer moved past the processed part of line,
;  or jump directly to assembly_line when the rest of line should be ignored
; note:
;  when esi is equal to [line_end], pointer is at the end of line and there is no data available at this address

And when you look at the definition of "initialize_output" function in OUTPUT.INC you should notice that it does not preserve any registers other than ESI. This is why the handler needs to take care of the opcode value held in AL and preserve it on its own.

We may proceed to a bit more interesting example, again analogous to one of the classic instruction handlers of fasm 1:

Code:

int_instruction:
        call    get_constant_value
        test    al,al
        jz      missing_argument
        cmp     al,2Eh
        je      invalid_argument        ; error on floating point argument
        call    keep_value
        mov     ecx,2
        call    initialize_output
        mov     al,0CDh
        stosb
        call    get_kept_value
        mov     ecx,1
        call    fit_value
        jc      value_out_of_range      ; error when number does not fit in a byte
        jmp     instruction_assembled

The errors are handled by jumping to some pre-defined routines of fasmg:

Code:

invalid_argument:
        mov     edx,_invalid_argument
        call    register_error
        jmp     assembly_line

But there is also an option of calling "register_error" internally within the handler, and it can then continue processing the instruction, perhaps to find out if there are more errors, etc.

Technically it is not necessary to call "keep_value" here, because "initialize output" is not among the parsing or expression evaluating functions, so simply preserving the EDX point would be enough in this case. But I believe it makes the example more clear and it might be a good habit to use this function anyway.

And another similar handler, this time for ENTER instruction, to show how to process a comma-separated arguments:

Code:

enter_instruction:
        mov     ecx,1+2+1
        call    initialize_output
        mov     al,0C8h
        stosb
        mov     [output_pointer],edi
        call    get_constant_value
        test    al,al
        jz      missing_argument
        cmp     al,2Eh
        je      invalid_argument
        mov     edi,[output_pointer]
        mov     ecx,2
        call    fit_value
        jc      value_out_of_range
        call    get_constituent_value
        jc      missing_argument
        cmp     al,','
        jne     invalid_argument
        call    get_constant_value
        test    al,al
        jz      missing_argument
        cmp     al,2Eh
        je      invalid_argument
        mov     edi,[output_pointer]
        add     edi,2
        mov     ecx,1
        call    fit_value
        jc      value_out_of_range
        jmp     instruction_assembled

Handling of special symbols like registers (or expressions containing registers) requires a bit more preparation. Basically we have to define some built-in ELEMENT-type symbols. I'm going to write more on this on another occasion.

02 May 2018, 09:34

Tomasz Grysztar

Joined: 16 Jun 2003
Posts: 8513
Location: Kraków, Poland

Tomasz Grysztar 03 May 2018, 17:37

In the TABLES.INC we can define not only instructions, but any other types of symbols that we need to have built-in. Let's add some entries that define a couple of registers as ELEMENT-type symbols:

Code:

  db 2,'ax',SYMCLASS_EXPRESSION,VALTYPE_ELEMENT,VAL_INTERNAL,'R'
  dd 0200h

  db 2,'cx',SYMCLASS_EXPRESSION,VALTYPE_ELEMENT,VAL_INTERNAL,'R'
  dd 0201h

  ; ...

  db 3,'eax',SYMCLASS_EXPRESSION,VALTYPE_ELEMENT,VAL_INTERNAL,'R'
  dd 0400h

  db 3,'ecx',SYMCLASS_EXPRESSION,VALTYPE_ELEMENT,VAL_INTERNAL,'R'
  dd 0401h

For the purpose of this demonstration we put "R" code into "attribute" field as a way of distinguishing a register symbol, with register size and number packed into "value" field.

A simple way to then handle such defined register as an operand to an instruction is to use "get_constituent_value" and then do multiple checks to find out what kind of symbol was found:

Code:

bswap_instruction:
        call    get_constituent_value
        jc      missing_argument 
        cmp     al,1Ah
        jne     invalid_argument        ; no symbol identifier
        test    edx,edx
        jz      invalid_argument        ; undefined symbol
        cmp     [edx+ValueDefinition.type],VALTYPE_ELEMENT
        jne     invalid_argument
        cmp     [edx+ValueDefinition.attribute],'R'
        jne     invalid_argument
        mov     eax,[edx+ValueDefinition.value]
        cmp     ah,4
        jne     invalid_argument
        push    eax
        mov     ecx,2
        call    initialize_output 
        mov     al,0Fh
        stosb
        pop     eax
        add     al,0C8h
        stosb
        jmp     instruction_assembled

For simplicity this handler assumes "use32" context and thus it allows only registers with size "4".

The above would work exactly the same if we used some other symbol type instead of VALTYPE_ELEMENT. In fact we could define a custom type just for registers, with a constant like VALTYPE_REGISTER to identify it. However, the ELEMENT type allows such defined register to be used in arithmetical expressions, crucial for things like x86 addressing.

We may modify the example BSWAP handler so that instead of looking for a plain symbol it processes a complete expression and checks if the result is a simple register. This approach is going to allow syntax like:

Code:

        bswap   (eax)
        bswap   ecx+0

While some may see this as an undesirable side-effect, this approach allows for some interesting uses where the register operand is actually a result of a computation on labels. But one can still use the previous approach to handle the "pure" register operands and use the expression method only for addresses within the square brackets - this would be like fasm 1 does it.

Now the handler that looks for the register in expression result looks like this:

Code:

bswap_instruction:
        call    get_expression_value
        jc      missing_argument
        cmp     byte [edi+ExpressionTerm.attributes],EXPR_NUMBER
        jne     invalid_argument        ; linear polynomials always have all terms marked as EXPR_NUMBER
        call    get_term_value
        cmp     dword [edx],0
        jne     invalid_argument        ; constant term has to be zero
        add     edi,sizeof.ExpressionTerm
        cmp     [edi+ExpressionTerm.attributes],0
        je      invalid_argument        ; variable term needs to be present
        call    get_term_value
        mov     ecx,1
        call    fit_value               ; this overwrites first byte of ExpressionTerm.attributes, but we do not need it anymore
        jc      invalid_argument
        cmp     byte [edi],1            ; the variable term must not be multiplied by any value other than 1
        jne     invalid_argument
        mov     ebx,[edi+ExpressionTerm.metadata]       ; this points to SymbolTree_Leaf of symbol used as a variable
        add     edi,sizeof.ExpressionTerm
        cmp     [edi+ExpressionTerm.attributes],0
        jne     invalid_argument        ; expect no more variable terms
        mov     edx,[ebx+SymbolTree_Leaf.definition]
        test    edx,edx
        jz      invalid_argument
        cmp     [edx+ValueDefinition.attribute],'R'
        jne     invalid_argument
        mov     eax,[edx+ValueDefinition.value]
        cmp     ah,4
        jne     invalid_argument
        push    eax
        mov     ecx,2
        call    initialize_output 
        mov     al,0Fh
        stosb
        pop     eax
        add     al,0C8h
        stosb
        jmp     instruction_assembled

This is definitely more complex, but the majority of this code could be moved to a routine like "take_register" to be used by all the instruction handlers.

When handling addresses in square brackets, the expression would need to be processed in a similar way, iterating through the terms and checking if they correspond to valid addressing registers. In case of addresses an important aspect is that they may contain some non-register variable terms, like a base symbol for relocatable section.

(One of my ideas for x86 encoder was that if after removing processed register terms the expression result still contained some variable ones, it would put this remaining polynomial into a temporary variable and schedule a call to macro like DWORD to process it. This would then allow to have relocatable formats implemented with macros, actually the same macros that already exist for fasmg.)

As for the variants like a segment prefix inside a square brackets, again the approach may vary. If we decided to allow segment register to be specified through an expression, the implementation would need two "get_expression_value" calls with "get_constituent_value" in between to handle the ":" separator. Otherwise a "peek_at_constituent_value" could be used to find out if the first symbol of an expression is a segment register, processing straight to "get_expression_value" if it is not.

PS. One more thing: the built-in ELEMENT symbols defined like this have no visible metadata that could be extracted with METADATAOF operator. This is because when extracting metadata, fasmg looks at the "value_length" field which is zero for all the symbols generated from TABLES.INC. This is what allowed us to put a customized data into the "value" field, otherwise it would need to be a valid pointer to a value of said length and in the same format as values of VALTYPE_NUMERIC symbols.

If we wanted to make these symbols to expose some metadata, we would need to either alter how the tables are processed in the "assembly_init" routine, or - perhaps better - add another routine to create internal symbol immediately after the call to "assembly_init" and before the actual assembly. If you'd like I can prepare some examples of how to add any fully customized symbol into the symbol tree during the initialization.

03 May 2018, 17:37

< Last Thread | Next Thread >

Forum Rules:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum