[SOLVED] State machine defining macros?

Index > Macroinstructions > [SOLVED] State machine defining macros?

Author

Thread

rshadr

Joined: 02 Mar 2025
Posts: 4
Location: EU West

rshadr 02 Mar 2025, 21:43

Hello, I've recently gotten into FASM because NASM wasn't able to do what I need. Specifically, I'm trying to implement the HTML Tokenizer in assembly using macros. The gist of it is that it is table-driven and you define it using a special syntax, e.g. (real-world example):

Code:

define_state data,DATA_STATE_INDEX
;; R12: Parser *parser (transient argument)
;; RDI: uint32_t codepoint
  switch
    '&' => {
      mov qword [r12 + Parser.tokenizer_return_state], DATA_STATE_INDEX
      mov qword [r12 + Parser.tokenizer_state], CHAR_REF_STATE_INDEX
      xor eax,eax
      ret
    }

   '<' => {
     mov qword [r12 + Parser.tokenizer_state], TAG_OPEN_STATE_INDEX
     xor eax,eax
     ret
   }

  0x00 => {
    push rdi
    call Parser_error
    pop rdi
    call Parser_emitCharacter
    xor eax,eax
    ret
  }

  EOF => {
    jmp Parser_emitEof
  }

  anything_else => {
    call  Parser_emitCharacter
    xor eax,eax
    ret
  }
  end switch
end define_state

define_state needs to output the following:

A bunch of labels pointing to the assembly fragments contained by each switch arm
A qword (pointer) array of length 128 (ASCII range) that will hold labels to handlers for the respective ASCII character. This array will be placed in a big array containing all the ASCII
handler arrays for all states at correct index.
A qword (pointer) holding the label for the case when the character == (uint32_t)(~0x00). This one will be put in a global table of EOF handlers.
A qword (pointer) holding the catch-all clause (mandatory). Also in a global table.

The approach I'm following doesn't assemble a "handler function" per se, it is rather, as you might have guessed, the "caller" that selects the right piece of code. This means I can't just use macros that output an assembly fragment representing the "handler function"

This switch statement would analyze each arm separately in the following manner:
1. If case value was already seen, error
2. If case value is a character in the range [0x00, 0x7F]:
2.1. Emit fragment with label. Update the ASCII table entry for character.
3. If case value is the symbol "EOF":
3.1. Emit fragment with label. Update the EOF handler.
4. If case value is the symbol "anything_else":
4.1 Emit fragment with label. Update anything_else handler.

If no anything_else case was given, error.
If no EOF case was given, set EOF handler to anything_else's handler
For each ASCII character left unspecified, set its entry to anything_else's

I hope this question is not too random or out of place, but I would really appreciate some tips on where to get started. After digging into the fasmg (using fasm2 for assembling) manuals, I was wondering whether this even possible to do, so please let me know if I'm wasting my time.

Last edited by rshadr on 16 Mar 2025, 09:39; edited 1 time in total

02 Mar 2025, 21:43

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20700
Location: In your JS exploiting you and your system

revolution 02 Mar 2025, 22:22

It isn't clear to me what your question is.

What do you need help with?

02 Mar 2025, 22:22

bitRAKE

Joined: 21 Jul 2003
Posts: 4260
Location: vpcmpistri

bitRAKE 03 Mar 2025, 01:27

Let's get crazy for a minute and see how far fasm2 can go - it'll boggle your mind:

Being lazy crazy - we don't want to edit the reference document beyond just adding our assembly code. This will make debugging and migration easier as the spec changes:

Code:

state   13.2.5.1 Data state
        ; Consume the next input character:

        U+0026 AMPERSAND (&)
        ; Set the return state to the data state. Switch to the character reference state.
        mov [r12 + Parser.tokenizer_return_state], DATA_STATE_INDEX
        mov [r12 + Parser.tokenizer_state], CHAR_REF_STATE_INDEX
        xor eax,eax
        ret

        U+003C LESS-THAN SIGN (<)
        ; Switch to the tag open state.
        mov [r12 + Parser.tokenizer_state], TAG_OPEN_STATE_INDEX
        xor eax,eax
        ret

        U+0000 NULL
        ; This is an unexpected-null-character parse error. Emit the current input character as a character token.
        push rdi
        call Parser_error
        pop rdi
        call Parser_emitCharacter
        xor eax,eax
        ret

        EOF
        ; Emit an end-of-file token.
        jmp Parser_emitEof

        Anything else
        ; Emit the current input character as a character token.
        call  Parser_emitCharacter
        xor eax,eax
        ret
end state

... there is no way that would assemble - right?

Wrong.

Code:

macro state? anchor*&
        ; create a unique name - to prefix cases - based on anchor value
        local prefix
        repeat 1,N:`anchor
                display `anchor,10
                prefix equ ?N
        end repeat

        calminstruction ? line&
                local bitmask,CODE,NAME,var,val
                initsym bitmask, 0 ; none defined

                match =end? =state?,line
                jyes done
                match =EOF,line
                jyes eof
                match =Anything =else,line
                jyes any
                match =U=+CODE NAME,line
                jyes new

                ; pass unknowns to assembly process
                assemble line
                exit

        noae:   err 'state: missing ''Anything else'''
                exit

        dupe:   err 'state: duplicate state'
                exit

        new:    stringify NAME
                display 9
                display NAME
                display 10
                arrange var,0x#CODE ; process CODE as hexadecimal
                compute val,var

                check 1 and (bitmask shr val)
                jyes dupe
                compute bitmask,bitmask or (1 shl val)

                arrange var,prefix.val:
                assemble var
                exit

        eof:    arrange var,prefix.=EOF:
                assemble var
                exit

        any:    arrange var,prefix.=ANY:
                assemble var
                exit

        done:   arrange var,prefix.=ANY
                check defined var
                jno noae

                arrange var,prefix.=EOF
                check defined var
                jyes have_eof
                ; without EOF case, routed to ANY case
                arrange var,=define prefix.=EOF prefix.=ANY
                assemble var
        have_eof:
                ; store the bitmask for use externally
                arrange var,prefix.=BITMAP
                publish var:,bitmask ; constant

                arrange line, =purge ?
                assemble line
        end calminstruction
end macro


; mock-up for assembly:
use64
DATA_STATE_INDEX        := 1
CHAR_REF_STATE_INDEX    := 2
TAG_OPEN_STATE_INDEX    := 3

include 'macro/struct.inc'
struct Parser
        tokenizer_return_state  dq ?
        tokenizer_state         dq ?
ends

Parser_emitCharacter:
Parser_emitEof:
Parser_error:

state   13.2.5.1 Data state
        ; Consume the next input character:

        U+0026 AMPERSAND (&)
        ; Set the return state to the data state. Switch to the character reference state.
        mov [r12 + Parser.tokenizer_return_state], DATA_STATE_INDEX
        mov [r12 + Parser.tokenizer_state], CHAR_REF_STATE_INDEX
        xor eax,eax
        ret

        U+003C LESS-THAN SIGN (<)
        ; Switch to the tag open state.
        mov [r12 + Parser.tokenizer_state], TAG_OPEN_STATE_INDEX
        xor eax,eax
        ret

        U+0000 NULL
        ; This is an unexpected-null-character parse error. Emit the current input character as a character token.
        push rdi
        call Parser_error
        pop rdi
        call Parser_emitCharacter
        xor eax,eax
        ret

        EOF
        ; Emit an end-of-file token.
        jmp Parser_emitEof

        Anything else
        ; Emit the current input character as a character token.
        call  Parser_emitCharacter
        xor eax,eax
        ret
end state

So, yeah - totally possible.

... and any kind of table can be generated outside the state definition (in a number of ways):

Code:

; check which states are used, brute force method:
macro check_defined anchor*&
        local prefix
        repeat 1,N:`anchor
                display 10,`anchor,': '
                repeat 128, i:0
                        if defined ?N.i
                                display `i,' '
                        end if
                end repeat
                display 10
        end repeat
end macro

check_defined   13.2.5.1 Data state

Edit: fixed a number of errors - there is probably more - only can find them in the doing.

03 Mar 2025, 01:27

bitRAKE

Joined: 21 Jul 2003
Posts: 4260
Location: vpcmpistri

bitRAKE 03 Mar 2025, 19:21

I didn't really like the heavy-handed implementation above, and using state namespaces solves so many future problems, imho. This is the complete HTML state space with all the character classes. It relies on the symbol resolution within fasmg to determine name collisions instead of manually doing it. Additionally, no state constraints are implemented - instead they should be implemented at table generation time, imho.

Description:		Download
Filename:	html_test.zip
Filesize:	8.94 KB
Downloaded:	174 Time(s)

_________________
¯\(°_o)/¯ AI may [not] have aided with the above reply.

Last edited by bitRAKE on 03 Mar 2025, 20:58; edited 1 time in total

03 Mar 2025, 19:21

rshadr

Joined: 02 Mar 2025
Posts: 4
Location: EU West

rshadr 03 Mar 2025, 19:44

Hello bitRAKE, thank you so much for the help! I was actually implementing your first solution into my project when I just saw you added new code a few minutes ago. I greatly appreciate that you took the time for this!

03 Mar 2025, 19:44

bitRAKE

Joined: 21 Jul 2003
Posts: 4260
Location: vpcmpistri

bitRAKE 03 Mar 2025, 20:25

You're welcome.

The present implementation makes the code very intuitive, imho:

Code:

U+0009 CHARACTER TABULATION (tab)
U+000A LINE FEED (LF)
U+000C FORM FEED (FF)
U+0020 SPACE
; If the current end tag token is an appropriate end tag token, then switch to the before attribute name state. Otherwise, treat it as per the "anything else" entry below.
        mov rax, [END_TAG_TOKEN]
        test [rax + Tag_Token.Flags], TT_APPROPRIATE
        jz ANY ; state local "anything else"
        .state GET_STATE 13.2.5.32 Before attribute name state
        mov [r12 + Parser.tokenizer_state], .state
        xor eax,eax
        retn

Local (dot) names are supported for the present case (arm); and other virtual (non-character) cases within the state can be easily jumped to. Very complex code could be isolated within a macro (if needed), but most the operations seem to be small fragments.

03 Mar 2025, 20:25

bitRAKE

Joined: 21 Jul 2003
Posts: 4260
Location: vpcmpistri

bitRAKE 03 Mar 2025, 22:38

If a name collision does happen you'll get an error message like:

Code:

Processed: ?88:
Error: definition of '769893432772280972386694579221112722235131985007023920842530320626198235977383555889.88' in conflict with already defined symbol.

Understandably, you're probably like: "wtf am I suppose to do with that?"

Just build with "-v 2" option and fasm2 will say exactly where the error is:

Code:

html_test.asm [21]:
        include 'html_states.inc'
html_states.inc [247]:
        U+0058 LATIN CAPITAL LETTER X
? [70] (CALM)
assemble :var [1]:
        ?88:
Processed: ?88:
Error: definition of '769893432772280972386694579221112722235131985007023920842530320626198235977383555889.88' in conflict with already defined symbol.

(88 is 'X' and the big integer is the state name.)

I can't anticipate other questions a "beginner" might have.

Edit: Or feed the error into scr.g:

Code:

; state_case_reverse
match state=.case,input
        display \
        10,'state:',9,  string state,\
        10,'case:',9,   case
else
        err 'unexpected usage.',10,'try:',9,'fasmg scr.g -I "define input 2645608968345021733469237830984.88"',10
end match

lol

03 Mar 2025, 22:38

rshadr

Joined: 02 Mar 2025
Posts: 4
Location: EU West

rshadr 09 Mar 2025, 21:20

Hello again, I've got some renewed questions.

Concering table generation, here's what I'm trying to do:

Code:

;; tokenizer_states.asm

;; [...] as defined in previous posts. Slightly updated naming

section '.rodata'

_Tokenizer_ascii_matrix:
;; Data state (DATA_STATE)
dq _Tokenizer_handle.data.0 ; 0x00
dq _Tokenizer_handle.data.any ; 0x01
; [...]
dq _Tokenizer_hande.data.60 ; 0x3C
; [...] fill ASCII range

;; RCDATA state (RCDATA_STATE)
; [...] idem

; [...] same for other states

_Tokenizer_eof_table:
dq _Tokenizer_handle.data.eof ; DATA_STATE
dq _Tokenizer_handle.rcdata.eof ;RCDATA_STATE
; [...] idem for remaining states

_Tokenizer_any_table:
dq _Tokenizer_handle.data.any ; DATA_STATE
dq _Tokenizer_handle.rcdata.any ; RCDATA_STATE
; [...] etc.

Of course, I don't want to be writing this by hand, otherwise I would not be using fasm Razz

. Out of the approaches I've considered for this, these are the problems I currently face:

How to assemble a (relocated) qword pointer at a specific address in the ELF output?
How to check if a label is defined from within a CALM instruction? (I know this works in macros, but they probably won't be much use here)

For (1):
Imagine the syntax for the table generator at file end was something like:

Code:

gen_tables
  DATA_STATE : data
  RCDATA_STATE : rcdata
 ; [...]
end gen_tables

Walking the lines, it would have force the pointer assembly for the constructed label name (e.g. "_Tokenizer_handle.data.0", "_Tokenizer_handle.markupDeclarationOpen.any") at a given address. For instance, the ASCII matrix looks like this:

times NUM_STATES (times 128 QWORD PTR)

...meaning that the "location problem" is nested twice - once for the state itself and for ASCII entries! The entire point of this associative syntax is to allow unordered declarations, storing the pointers based on the *_STATE value.

09 Mar 2025, 21:20

bitRAKE

Joined: 21 Jul 2003
Posts: 4260
Location: vpcmpistri

bitRAKE 10 Mar 2025, 00:21

I think about if differently - first I write the dispatch code and then I adapt the tables to the code. The complete tokenizer is probably under 64K - meaning only WORD offsets are needed for the tables. Perhaps we think about eliminating the tables later - there is very little branching at each state.

Why is an "ANY" table needed? Instead, store the "ANY" function where function is not defined. (Also, there is a case for which state.case == ANY for all states.)

In CALM it is very similar, perhaps something like:

Code:

        arrange labl, state=.=?val
        check defined labl
        jyes have
        arrange labl,state=.=ANY
have:   arrange labl,=db labl
        assemble labl

... macro is fine, too. See the test macro.

I imagine a table of state pointers (to case table), and then each state has a table (of cases). Is this the nesting you imply?

Code:

StateTable:
        iterate STATE, ...
                repeat 1,N:`STATE
                        dq Case_Table.N
                end repeat
        end iterate

        iterate STATE, ...
                repeat 1,N:`STATE
                        if defined N.EOF
                                dq N.EOF
                        else
                                dq N.ANY
                        end if
                Case_Table.N:
                        ; generate state table ...
                        repeat 128,i:0
                                if defined N.i
                                        dq N.i
                                else
                                        dq N.ANY
                                end if
                        end repeat
                end repeat
        end iterate

Which leads to the following dispatcher:

Code:

        mov rax, [<current_state>]
        lea rcx, [StateTable]
        ; we've mapped the state RAX to integers here
        mov rcx, [rcx + rax*8] ; this is the case table
        mov rax, [<current_character>] ; -1 for EOF
        call [rcx + rax*8]

... focusing on the dispatcher makes it easier to know what data we need. If you need a different dispatcher/data then post the dispatcher and we can work on how to create the data for it.

_________________
¯\(°_o)/¯ AI may [not] have aided with the above reply.

10 Mar 2025, 00:21

bitRAKE

Joined: 21 Jul 2003
Posts: 4260
Location: vpcmpistri

bitRAKE 13 Mar 2025, 20:23

Thinking towards the extreme - what is the most challenging dispatch I can imagine?

Let's explain what is going on here and why: The over-arching goal is to create a tiny table with cache-awareness, because dynamic dispatch can't be predicted by the processor. So, we cram every state into 128 bytes and eight target pointers. Because each state table is constant (in size) the code starts with multiplying the state to get the address offset to the correct table of cases.

Since the character dispatch range only covers [0,122]; we can clamp codepoints above to 123 and that becomes our ANY case. The reduced character usage range leaves four dispatch slots for other uses - we only have EOF currently. The EOF signal could come from anywhere, but I've integrated it into the current character. The clamping code is setup to preserve that signal, but it forces the tables to be offset for the negative value.

Code:

        ; need character slots for ANY and EOF
        assert STATES.maximum < 126

        ; cacheline aware case tables
        assert sizeof StateTable and 3Fh = 0

        .offset = 1 ; offset for EOF case

        imul eax, dword [<current state>], sizeof StateTable
        lea rcx, [StateTable]
        lea rcx, [rcx + rax + .offset]

        ; clamp character to [0,123], state.123 = ANY for all
        mov edx, 1 + STATES.maximum
        movsx rax, dword [<current character>] ; UTF-32 | EOF(-1)
        cmp rax, rdx
        cmovg rax, rdx ; signed so that EOF preserved

        ; [EOF][00..z][ANY]...
        movzx eax, byte [rcx + rax]

        ; state branches are after byte indices
        call [rcx + (128 - .offset) + rax*8]

... how to create the data tables for this complex dispatcher?

_________________
¯\(°_o)/¯ AI may [not] have aided with the above reply.

13 Mar 2025, 20:23

rshadr

Joined: 02 Mar 2025
Posts: 4
Location: EU West

rshadr 13 Mar 2025, 21:53

Here's the state of things right now (GitHub for reference):

Code:

section '.rodata'
_k_h5a_Tokenizer_common_handler_table:
  ;; Heterogenous table
  dq _k_h5a_Tokenizer_ascii_matrix ;2d
  dq _k_h5a_Tokenizer_unicode_table ;1d
  dq _k_h5a_Tokenizer_eof_table ;1d

; [...]

section '.text' executable

_h5a_Tokenizer_main:
  ;; R12 (s/lost): H5aParser *
  ;; -> EAX: status
  push rbx
  push r10

  .charLoop:
    lea   rbx, [_k_h5a_Tokenizer_state_flags] ;byte array
    mov   rax, qword [r12 + H5aParser.input_stream.user_data]
    xlatb

    test  al, STATE_BIT_SPC_ACTION
    LIKELY jz .charLoop.postSpcAction

    .charLoop.spcAction:
    ;; XXX: do spcAction

    .charLoop.postSpcAction:
      test  al, STATE_BIT_NO_GETCHAR
      LIKELY   jz .charLoop.readChar
      nop
      UNLIKELY jmp .charLoop

    .charLoop.readChar:
      mov   rdi, qword [r12 + H5aParser.input_stream.user_data]
      call  near qword [r12 + H5aParser.input_stream.get_char_cb]
      mov   r10, rax ;keep for later
    
    .charLoop.hashChar:
      ;; hash result (AL):
      ;;  0x00 : ASCII codepoint   (< 0x007F)
      ;;  0x01 : Unicode codepoint (> 0x007F and != ~0x00)
      ;;  0x02 : EOF               (== ~0x00)
      xor    rax,rax
      xor    rdi,rdi

      test   r10d, (not 0x7F)
      setnz  al

      mov    ecx, r10d
      not    ecx
      test   ecx,ecx
      setz   dil

      add    al, dil

    .charLoop.dispatchCommon:
      lea   rbx, [_k_h5a_Tokenizer_common_handler_table]
      mov   rbx, qword [rbx + rax * 8]

      test  rax,rax ;unicode/EOF?
      UNLIKELY jnz .charLoop.unicodeOrEofLoop

    .charLoop.asciiLoop:
      mov  rax, qword [r12 + H5aParser.tokenizer.state]
      shl  rax, (bsr 128 * 8)
      lea  rax, [rbx + rax] ;load state's LUT
      lea  rax, [rax + r10 * 8] ;load handler
      mov  rdi, r10
      call near [rax]

      test  eax, RESULT_BIT_AGAIN
      jnz   .exit
      test  eax, RESULT_BIT_LEAVE
      jnz   .charLoop.asciiLoop

      jmp .charLoop

    .charLoop.unicodeOrEofLoop:
      mov  rax, qword [r12 + H5aParser.tokenizer.state]
      shl  rax, (bsr 2 * 8)
      lea  rax, [rbx + r10 * 8] ;load handler
      call near [rax]
      
      test  eax, RESULT_BIT_AGAIN
      jnz   .exit
      test  eax, RESULT_BIT_LEAVE
      jnz   .charLoop.unicodeOrEofLoop

      jmp .charLoop

  .exit:
    pop r10
    pop rbx
    pop r12 ;caller
    xor eax,eax
    ret

Some explanations:
_k_h5a_Tokenizer_common_handler_table holds 3 pointers:

ascii_matrix : a NUM_STATES x 128 table of pointers. Each entry points to the start of a function fragment handling the case, with the unicode character in EDI/RDI
unicode_table : NUM_STATES long table of pointers in case character is out of the ASCII range. See dispatcher code.
eof_table : action in case an EOF character was reached.

They are built up like this:

For each ASCII character code (0x00 ... 0x7F):
If there is a specified handler, store it at current pos in ASCII table
Otherwise, store the "Anything else" fragment address instead
This comprises 128 * 8 bytes of ascii_matrix. The region begins at &ascii_matrix + state_index * 128 * 8

If there is an EOF case specified, store the frag address at &eof_table + state_index * 8
Store the "Anything else" frag address at &unicode_table + state_index * 8 (unconditionally).

The tokenizer states only ever directly address ASCII characters or the symbolic EOF character. The dispatcher uses a hash function to determine a character class for the current codepoint (see .charLoop.hashChar) and advances accordingly. In any cases, the common_table entry is loaded for the character class. Unicode/EOF situations are handled the same way because their tables work the same. The only difference lies in the fragments themselves: for unicode cases, they receive a character argument in EDI/RSI, while EOF handlers don't care. This should not be relevant to the rest of the code.

Now comes the tricky part: string patterns

Code:

define_state markupDeclarationOpen
  [[Exactly "--"]]
    xor al,al
    ;[[fallthrough]]
    ret
  [[Case-insensitively "DOCTYPE"]]
    xor al,al
    ret
  [[Exactly "[CDATA["]]
    xor al,al
    ret
  [[Anything else]]
    xor al,al
    ret
end define_state

Spec reference here. This skeleton showcases the string matchers, their names are self-explanatory. If we look at the dispatcher tables, we notice the fragments take in a character argument either way: even for EOF cases, the symbolic character first had to be read through user callback. Hence, we want string patterns to happen before we attempt to do any single-character handling. It also happens that some states require unconditional steps before matching the character (spec example). This is why there's STATE_BIT_SPC_ACTION. The STATE_BIT bit enum is specified on a per-state basis in a byte array which is accessed at the very top level of the tokenizer loop. Notice how it is not checked when potentially reconsuming/reprocessing the character in different states. We trust the spec that there if, say, "A" is to be reconsumed in "banana state", "banana state" will accept single-character patterns in the first place.

Anyway, STATE_BIT_SPC_ACTION specifies that there is a special long fragment that will take care of peculiar things. This long fragment is essentially a cascade of conditional blocks that each represent a string patterns action body when matched and return just like regular states. tokenizer_states.g on GitHub implements this syntax and seems to work for generating this "cascade". The head of the cascade will be reserved for unconditional actions, meaning there will not be a "ret" instruction required:

Code:

define_state numericCharacterReference
  [[::before]]
   ; welcome to Cascading State Sheets 
    movzx qword [r12 + H5aParser.char_ref], 0x00
  [[U+0078 LATIN SMALL LETTER X]]
  [[U+0058 LATIN CAPITAL LETTER X]]
    ;; XXX: append to temp buffer
   movzx qword [r12 + H5aParser.tokenizer.state], HEXADECIMAL_CHARACTER_REFERENCE_START_STATE
   xor al,al
   ret

  [[Anything else]]
    movzx qword [r12 + H5aParser.tokenizer.state], DECIMAL_CHARACTER_REFERENCE_START_STATE
    mov al, RESULT_RECONSUME
    ret
end define_state

... so yeah. I have no idea if this is the "best" (or even any logical) way to do build such a dispatcher, but I shall try to use the method you described in the 2nd-last post with "check defined". Those are some really interesting other points you brought up, it's cool to see a different perspective on this.

P.S.: can't run the code, I've avoided writing anything substantial before I get the macro sorcery right.

Edit: [[::before]] is not supported yet.

13 Mar 2025, 21:53

< Last Thread | Next Thread >

Forum Rules:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum