flat assembler
Message board for the users of flat assembler.
![]() |
Author |
|
rshadr 02 Mar 2025, 21:43
Hello, I've recently gotten into FASM because NASM wasn't able to do what I need. Specifically, I'm trying to implement the HTML Tokenizer in assembly using macros. The gist of it is that it is table-driven and you define it using a special syntax, e.g. (real-world example):
Code: define_state data,DATA_STATE_INDEX ;; R12: Parser *parser (transient argument) ;; RDI: uint32_t codepoint switch '&' => { mov qword [r12 + Parser.tokenizer_return_state], DATA_STATE_INDEX mov qword [r12 + Parser.tokenizer_state], CHAR_REF_STATE_INDEX xor eax,eax ret } '<' => { mov qword [r12 + Parser.tokenizer_state], TAG_OPEN_STATE_INDEX xor eax,eax ret } 0x00 => { push rdi call Parser_error pop rdi call Parser_emitCharacter xor eax,eax ret } EOF => { jmp Parser_emitEof } anything_else => { call Parser_emitCharacter xor eax,eax ret } end switch end define_state define_state needs to output the following:
The approach I'm following doesn't assemble a "handler function" per se, it is rather, as you might have guessed, the "caller" that selects the right piece of code. This means I can't just use macros that output an assembly fragment representing the "handler function" This switch statement would analyze each arm separately in the following manner: 1. If case value was already seen, error 2. If case value is a character in the range [0x00, 0x7F]: 2.1. Emit fragment with label. Update the ASCII table entry for character. 3. If case value is the symbol "EOF": 3.1. Emit fragment with label. Update the EOF handler. 4. If case value is the symbol "anything_else": 4.1 Emit fragment with label. Update anything_else handler. If no anything_else case was given, error. If no EOF case was given, set EOF handler to anything_else's handler For each ASCII character left unspecified, set its entry to anything_else's I hope this question is not too random or out of place, but I would really appreciate some tips on where to get started. After digging into the fasmg (using fasm2 for assembling) manuals, I was wondering whether this even possible to do, so please let me know if I'm wasting my time. Last edited by rshadr on 16 Mar 2025, 09:39; edited 1 time in total |
|||
![]() |
|
revolution 02 Mar 2025, 22:22
It isn't clear to me what your question is.
What do you need help with? |
|||
![]() |
|
bitRAKE 03 Mar 2025, 19:21
I didn't really like the heavy-handed implementation above, and using state namespaces solves so many future problems, imho. This is the complete HTML state space with all the character classes. It relies on the symbol resolution within fasmg to determine name collisions instead of manually doing it. Additionally, no state constraints are implemented - instead they should be implemented at table generation time, imho.
Last edited by bitRAKE on 03 Mar 2025, 20:58; edited 1 time in total |
|||||||||||
![]() |
|
rshadr 03 Mar 2025, 19:44
Hello bitRAKE, thank you so much for the help! I was actually implementing your first solution into my project when I just saw you added new code a few minutes ago. I greatly appreciate that you took the time for this!
|
|||
![]() |
|
bitRAKE 03 Mar 2025, 20:25
You're welcome.
The present implementation makes the code very intuitive, imho: Code: U+0009 CHARACTER TABULATION (tab) U+000A LINE FEED (LF) U+000C FORM FEED (FF) U+0020 SPACE ; If the current end tag token is an appropriate end tag token, then switch to the before attribute name state. Otherwise, treat it as per the "anything else" entry below. mov rax, [END_TAG_TOKEN] test [rax + Tag_Token.Flags], TT_APPROPRIATE jz ANY ; state local "anything else" .state GET_STATE 13.2.5.32 Before attribute name state mov [r12 + Parser.tokenizer_state], .state xor eax,eax retn |
|||
![]() |
|
bitRAKE 03 Mar 2025, 22:38
If a name collision does happen you'll get an error message like:
Code: Processed: ?88: Error: definition of '769893432772280972386694579221112722235131985007023920842530320626198235977383555889.88' in conflict with already defined symbol. Just build with "-v 2" option and fasm2 will say exactly where the error is: Code: html_test.asm [21]: include 'html_states.inc' html_states.inc [247]: U+0058 LATIN CAPITAL LETTER X ? [70] (CALM) assemble :var [1]: ?88: Processed: ?88: Error: definition of '769893432772280972386694579221112722235131985007023920842530320626198235977383555889.88' in conflict with already defined symbol. I can't anticipate other questions a "beginner" might have. Edit: Or feed the error into scr.g: Code: ; state_case_reverse match state=.case,input display \ 10,'state:',9, string state,\ 10,'case:',9, case else err 'unexpected usage.',10,'try:',9,'fasmg scr.g -I "define input 2645608968345021733469237830984.88"',10 end match ![]() |
|||
![]() |
|
rshadr 09 Mar 2025, 21:20
Hello again, I've got some renewed questions.
Concering table generation, here's what I'm trying to do: Code: ;; tokenizer_states.asm ;; [...] as defined in previous posts. Slightly updated naming section '.rodata' _Tokenizer_ascii_matrix: ;; Data state (DATA_STATE) dq _Tokenizer_handle.data.0 ; 0x00 dq _Tokenizer_handle.data.any ; 0x01 ; [...] dq _Tokenizer_hande.data.60 ; 0x3C ; [...] fill ASCII range ;; RCDATA state (RCDATA_STATE) ; [...] idem ; [...] same for other states _Tokenizer_eof_table: dq _Tokenizer_handle.data.eof ; DATA_STATE dq _Tokenizer_handle.rcdata.eof ;RCDATA_STATE ; [...] idem for remaining states _Tokenizer_any_table: dq _Tokenizer_handle.data.any ; DATA_STATE dq _Tokenizer_handle.rcdata.any ; RCDATA_STATE ; [...] etc. Of course, I don't want to be writing this by hand, otherwise I would not be using fasm ![]()
For (1): Imagine the syntax for the table generator at file end was something like: Code: gen_tables DATA_STATE : data RCDATA_STATE : rcdata ; [...] end gen_tables Walking the lines, it would have force the pointer assembly for the constructed label name (e.g. "_Tokenizer_handle.data.0", "_Tokenizer_handle.markupDeclarationOpen.any") at a given address. For instance, the ASCII matrix looks like this: times NUM_STATES (times 128 QWORD PTR) ...meaning that the "location problem" is nested twice - once for the state itself and for ASCII entries! The entire point of this associative syntax is to allow unordered declarations, storing the pointers based on the *_STATE value. |
|||
![]() |
|
bitRAKE 10 Mar 2025, 00:21
I think about if differently - first I write the dispatch code and then I adapt the tables to the code. The complete tokenizer is probably under 64K - meaning only WORD offsets are needed for the tables. Perhaps we think about eliminating the tables later - there is very little branching at each state.
Why is an "ANY" table needed? Instead, store the "ANY" function where function is not defined. (Also, there is a case for which state.case == ANY for all states.) In CALM it is very similar, perhaps something like: Code: arrange labl, state=.=?val check defined labl jyes have arrange labl,state=.=ANY have: arrange labl,=db labl assemble labl I imagine a table of state pointers (to case table), and then each state has a table (of cases). Is this the nesting you imply? Code: StateTable: iterate STATE, ... repeat 1,N:`STATE dq Case_Table.N end repeat end iterate iterate STATE, ... repeat 1,N:`STATE if defined N.EOF dq N.EOF else dq N.ANY end if Case_Table.N: ; generate state table ... repeat 128,i:0 if defined N.i dq N.i else dq N.ANY end if end repeat end repeat end iterate Code: mov rax, [<current_state>] lea rcx, [StateTable] ; we've mapped the state RAX to integers here mov rcx, [rcx + rax*8] ; this is the case table mov rax, [<current_character>] ; -1 for EOF call [rcx + rax*8] |
|||
![]() |
|
bitRAKE 13 Mar 2025, 20:23
Thinking towards the extreme - what is the most challenging dispatch I can imagine?
Let's explain what is going on here and why: The over-arching goal is to create a tiny table with cache-awareness, because dynamic dispatch can't be predicted by the processor. So, we cram every state into 128 bytes and eight target pointers. Because each state table is constant (in size) the code starts with multiplying the state to get the address offset to the correct table of cases. Since the character dispatch range only covers [0,122]; we can clamp codepoints above to 123 and that becomes our ANY case. The reduced character usage range leaves four dispatch slots for other uses - we only have EOF currently. The EOF signal could come from anywhere, but I've integrated it into the current character. The clamping code is setup to preserve that signal, but it forces the tables to be offset for the negative value. Code: ; need character slots for ANY and EOF assert STATES.maximum < 126 ; cacheline aware case tables assert sizeof StateTable and 3Fh = 0 .offset = 1 ; offset for EOF case imul eax, dword [<current state>], sizeof StateTable lea rcx, [StateTable] lea rcx, [rcx + rax + .offset] ; clamp character to [0,123], state.123 = ANY for all mov edx, 1 + STATES.maximum movsx rax, dword [<current character>] ; UTF-32 | EOF(-1) cmp rax, rdx cmovg rax, rdx ; signed so that EOF preserved ; [EOF][00..z][ANY]... movzx eax, byte [rcx + rax] ; state branches are after byte indices call [rcx + (128 - .offset) + rax*8] |
|||
![]() |
|
rshadr 13 Mar 2025, 21:53
Here's the state of things right now (GitHub for reference):
Code: section '.rodata' _k_h5a_Tokenizer_common_handler_table: ;; Heterogenous table dq _k_h5a_Tokenizer_ascii_matrix ;2d dq _k_h5a_Tokenizer_unicode_table ;1d dq _k_h5a_Tokenizer_eof_table ;1d ; [...] section '.text' executable _h5a_Tokenizer_main: ;; R12 (s/lost): H5aParser * ;; -> EAX: status push rbx push r10 .charLoop: lea rbx, [_k_h5a_Tokenizer_state_flags] ;byte array mov rax, qword [r12 + H5aParser.input_stream.user_data] xlatb test al, STATE_BIT_SPC_ACTION LIKELY jz .charLoop.postSpcAction .charLoop.spcAction: ;; XXX: do spcAction .charLoop.postSpcAction: test al, STATE_BIT_NO_GETCHAR LIKELY jz .charLoop.readChar nop UNLIKELY jmp .charLoop .charLoop.readChar: mov rdi, qword [r12 + H5aParser.input_stream.user_data] call near qword [r12 + H5aParser.input_stream.get_char_cb] mov r10, rax ;keep for later .charLoop.hashChar: ;; hash result (AL): ;; 0x00 : ASCII codepoint (< 0x007F) ;; 0x01 : Unicode codepoint (> 0x007F and != ~0x00) ;; 0x02 : EOF (== ~0x00) xor rax,rax xor rdi,rdi test r10d, (not 0x7F) setnz al mov ecx, r10d not ecx test ecx,ecx setz dil add al, dil .charLoop.dispatchCommon: lea rbx, [_k_h5a_Tokenizer_common_handler_table] mov rbx, qword [rbx + rax * 8] test rax,rax ;unicode/EOF? UNLIKELY jnz .charLoop.unicodeOrEofLoop .charLoop.asciiLoop: mov rax, qword [r12 + H5aParser.tokenizer.state] shl rax, (bsr 128 * 8) lea rax, [rbx + rax] ;load state's LUT lea rax, [rax + r10 * 8] ;load handler mov rdi, r10 call near [rax] test eax, RESULT_BIT_AGAIN jnz .exit test eax, RESULT_BIT_LEAVE jnz .charLoop.asciiLoop jmp .charLoop .charLoop.unicodeOrEofLoop: mov rax, qword [r12 + H5aParser.tokenizer.state] shl rax, (bsr 2 * 8) lea rax, [rbx + r10 * 8] ;load handler call near [rax] test eax, RESULT_BIT_AGAIN jnz .exit test eax, RESULT_BIT_LEAVE jnz .charLoop.unicodeOrEofLoop jmp .charLoop .exit: pop r10 pop rbx pop r12 ;caller xor eax,eax ret Some explanations: _k_h5a_Tokenizer_common_handler_table holds 3 pointers:
They are built up like this: For each ASCII character code (0x00 ... 0x7F): If there is a specified handler, store it at current pos in ASCII table Otherwise, store the "Anything else" fragment address instead This comprises 128 * 8 bytes of ascii_matrix. The region begins at &ascii_matrix + state_index * 128 * 8 If there is an EOF case specified, store the frag address at &eof_table + state_index * 8 Store the "Anything else" frag address at &unicode_table + state_index * 8 (unconditionally). The tokenizer states only ever directly address ASCII characters or the symbolic EOF character. The dispatcher uses a hash function to determine a character class for the current codepoint (see .charLoop.hashChar) and advances accordingly. In any cases, the common_table entry is loaded for the character class. Unicode/EOF situations are handled the same way because their tables work the same. The only difference lies in the fragments themselves: for unicode cases, they receive a character argument in EDI/RSI, while EOF handlers don't care. This should not be relevant to the rest of the code. Now comes the tricky part: string patterns Code: define_state markupDeclarationOpen [[Exactly "--"]] xor al,al ;[[fallthrough]] ret [[Case-insensitively "DOCTYPE"]] xor al,al ret [[Exactly "[CDATA["]] xor al,al ret [[Anything else]] xor al,al ret end define_state Spec reference here. This skeleton showcases the string matchers, their names are self-explanatory. If we look at the dispatcher tables, we notice the fragments take in a character argument either way: even for EOF cases, the symbolic character first had to be read through user callback. Hence, we want string patterns to happen before we attempt to do any single-character handling. It also happens that some states require unconditional steps before matching the character (spec example). This is why there's STATE_BIT_SPC_ACTION. The STATE_BIT bit enum is specified on a per-state basis in a byte array which is accessed at the very top level of the tokenizer loop. Notice how it is not checked when potentially reconsuming/reprocessing the character in different states. We trust the spec that there if, say, "A" is to be reconsumed in "banana state", "banana state" will accept single-character patterns in the first place. Anyway, STATE_BIT_SPC_ACTION specifies that there is a special long fragment that will take care of peculiar things. This long fragment is essentially a cascade of conditional blocks that each represent a string patterns action body when matched and return just like regular states. tokenizer_states.g on GitHub implements this syntax and seems to work for generating this "cascade". The head of the cascade will be reserved for unconditional actions, meaning there will not be a "ret" instruction required: Code: define_state numericCharacterReference [[::before]] ; welcome to Cascading State Sheets ... so yeah. I have no idea if this is the "best" (or even any logical) way to do build such a dispatcher, but I shall try to use the method you described in the 2nd-last post with "check defined". Those are some really interesting other points you brought up, it's cool to see a different perspective on this. P.S.: can't run the code, I've avoided writing anything substantial before I get the macro sorcery right. Edit: [[::before]] is not supported yet. |
|||
![]() |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.