flat assembler
Message board for the users of flat assembler.
Index
> Tutorials and Examples > C lexer in fasmg |
Author |
|
MaoKo 14 Jun 2021, 15:34
Hi. I've written a small tool that create a lexer generator in fasmg from regex specification in fasmg. I can't post it yet because I use instruction not available in "official" fasmg. It's need to be patched. I generate an uncompressed raw DFA. Some knowledge about how regex is converted to DFA and how DFA work is very useful to understanding the tool. The format of the generated table is as follow:
Code: _name_module:: ; define virtual block ; [TOKEN] .EXAMPLE_1 = 0 ; define token to be processing by parser or other tools .IDENTIFIER = 1 .COMMENT = 2 .ERROR = 3 .EOF = 4 ; note: ERROR and EOF are always generated by the tool, no need to define it ._state_table: db 0,0 ..., 0 ; dead state db 1,2, ...,0 ; state 1 is the first state. The whole line define 256 byte fo each possible input byte. This is disadvantage of uncompressed DFA ._string_table: ; Some debug table just for display ._accept_table: db 2, .IDENTIFIER ; mark which state is final and what token it's recognize db 0 ; always finish with 0 ._skip_table: db .COMMENT ; mark some token that can be skipped db 0 ; always finish with 0 I will include the DFA for C language, if you want to test yourself. Here is a simple lexer driven module in fasmg. Code: virtual at 0H include "token_c.inc" end virtual virtual at 0H _raw:: file "file.c" ; put your file here _size = ($ - $$) end virtual macro _restore_state? _start?*, _last?*, _backup?*, _token?*, _count?*, _stream?* local _size, _symbol if (_start = 0H) ; reach DEAD state if (_last = 0H) display "lexing error", 00AH end if load _offset:4H from token_c: (token_c._string_table + (_token shl 2H)) load _size:byte from token_c:_offset load _symbol:_size from token_c: (_offset+1H) display _symbol, 00AH if ((_token = token_c.IDENTIFIER) | (_token = token_c.STRING) | (_token = token_c.COMMENT)) load _lexeme: (_backup - _first) from _raw:_first display 01BH, "[31m", string(_lexeme), 00AH, 01BH, "[0m" end if _start = 1H _token = 0H _last = 0H _stream = _backup _first = _stream end if end macro _accept? := 0H _index = 0H while (not 0H) load _test:byte from token_c: (token_c._accept_table+(_index shl 1H)) if (_test = 0H) break end if load _token:byte from token_c: ((token_c._accept_table+(_index shl 1H))+1H) rept 1H _:_test _accept?._? = _token end rept _index = _index + 1H end while _source_c = 0H _backup = "" _count = 0H _start = 1H _last = 0H _token = 0H _first = _source_c while (not 0H) rept 1H _:_start if (definite _accept?._?) _last = _start _backup = _source_c _token = _accept?._? end if end rept if (_source_c >= _size) break end if load _current:byte from _raw:_source_c load _start:byte from token_c: (token_c._state_table+((_start shl 8H) + _current)) _source_c = _source_c + 1H _restore_state _start, _last, _backup, _token, _count, _source_c end while _start = 0H _restore_state _start, _last, _backup, _token, _count, _source_c It can handle 5K~6K ligne of C in 3-4 second in the current fasmg, which IMO is very good. But of course you can use the table in pure assembler for better performance. Once you understand how to use the table, is very easy to adapt in a specific language.
|
|||||||||||
14 Jun 2021, 15:34 |
|
MaoKo 06 Dec 2021, 15:30
This is the full package for generating this:
https://github.com/MaoKo/gFGFX The newest version of fasmg include an instruction `call` which can call calminstruction incredibly faster than the old asmcmd. This allow me to build this project. Thanks you Privalov. So with this, you can generate lexing table for pascal,c,custom language etc with the help of regex and automata. You can also use regex for other purpose as well: Code: include "xcalm.alm" include "bitset.inc" include "automata.inc" include "regex.inc" _regex?._declare? _vector, _constant?._false? _regex?._compile? _vector, "(x|y|z)mm([0-9]|(1[0-9])|(2[0-9])|30|31)" calminstruction _check_vector? _regex?*, _register?* local _return _init const __return, _return stringify _register call _regex?._match?, __return, _regex, _register cmp _return, _constant?._false? jyes _check_vector_finish _asmcmd =display "Match" _check_vector_finish: end calminstruction _check_vector _vector, xmm32 The regex library normally support union (|), concat, closure (*,+,?), dot, string, character class ([0-9], also posix range [[:alpha:][:upper:]]), operator between character class ({-},{+},{&}), sequence (a{3}, a{4,6}), regex option like (?i:abc), (?s:.*) and comment (?#COMMENT). Regex alias are also supported like identifier equal "[[:alpha:][:digit:]_]+" and you can it in an "arith" regex like "{identifier}={identifier}[+\-*/]{identifier}". Without calminstruction it will be impossible to build such tools. I will still try to optimize it even further. For example, it took 40 second to generate DFA for the c lang (token_c.asm). I known, other tools exists and they are very fast. Principally I wanted to demonstrate a kind of POC for fasmg. If you have idea of new feature don't hesitate to tell me. I hope, I could build a yacc-like to create a complete tools for lexing and parsing. This allow the possibility of creating small language inside assembly source and compiling (generating) it into assembly at the same time as regular assembly instruction. |
|||
06 Dec 2021, 15:30 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.