Hi. I've written a small tool that create a lexer generator in fasmg from regex specification in fasmg. I can't post it yet because I use instruction not available in "official" fasmg. It's need to be patched. I generate an uncompressed raw DFA. Some knowledge about how regex is converted to DFA and how DFA work is very useful to understanding the tool. The format of the generated table is as follow:
_name_module:: ; define virtual block


.EXAMPLE_1 = 0 ; define token to be processing by parser or other tools
.ERROR = 3
.EOF = 4 ; note: ERROR and EOF are always generated by the tool, no need to define it

db 0,0 ..., 0 ; dead state
db 1,2, ...,0 ; state 1 is the first state. The whole line define 256 byte fo each possible input byte. This is disadvantage of uncompressed DFA

._string_table: ; Some debug table just for display
db 2, .IDENTIFIER ; mark which state is final and what token it's recognize
db 0 ; always finish with 0

db .COMMENT ; mark some token that can be skipped
db 0 ; always finish with 0

I will include the DFA for C language, if you want to test yourself.
Here is a simple lexer driven module in fasmg.
virtual at 0H
  include "token_c.inc"                                                                                                                            
end virtual        
virtual at 0H
_raw:: file "file.c" ; put your file here
_size = ($ - $$)
end virtual
macro _restore_state? _start?*, _last?*, _backup?*, _token?*, _count?*, _stream?*
    local _size, _symbol
    if (_start = 0H) ; reach DEAD state
        if (_last = 0H)
            display "lexing error", 00AH
        end if
        load _offset:4H from token_c: (token_c._string_table + (_token shl 2H))
        load _size:byte from token_c:_offset
        load _symbol:_size from token_c: (_offset+1H)
        display _symbol, 00AH
        if ((_token = token_c.IDENTIFIER) | (_token = token_c.STRING) | (_token = token_c.COMMENT))
            load _lexeme: (_backup - _first) from _raw:_first
            display 01BH, "[31m", string(_lexeme), 00AH, 01BH, "[0m"
        end if
        _start = 1H        
        _token = 0H
        _last = 0H
        _stream = _backup                                                
        _first = _stream                                                                                                                           
    end if                    
end macro        

_accept? := 0H
_index = 0H                                                              
while (not 0H)
    load _test:byte from token_c: (token_c._accept_table+(_index shl 1H))
    if (_test = 0H)
    end if
    load _token:byte from token_c: ((token_c._accept_table+(_index shl 1H))+1H)
    rept 1H _:_test
        _accept?._? = _token
    end rept
    _index = _index + 1H
end while

_source_c = 0H
_backup = ""
_count = 0H
_start = 1H
_last = 0H
_token = 0H
_first = _source_c

while (not 0H)
    rept 1H _:_start
        if (definite _accept?._?)
            _last = _start
            _backup = _source_c
            _token = _accept?._?
        end if
    end rept
    if (_source_c >= _size)
    end if
    load _current:byte from _raw:_source_c
    load _start:byte from token_c: (token_c._state_table+((_start shl 8H) + _current))
    _source_c = _source_c + 1H
    _restore_state _start, _last, _backup, _token, _count, _source_c
end while
_start = 0H
_restore_state _start, _last, _backup, _token, _count, _source_c

It can handle 5K~6K ligne of C in 3-4 second in the current fasmg, which IMO is very good. But of course you can use the table in pure assembler for better performance. Once you understand how to use the table, is very easy to adapt in a specific language.

Post 14 Jun 2021, 15:34
