flat assembler
Message board for the users of flat assembler.

Index > Tutorials and Examples > C lexer in fasmg

Author
Thread Post new topic Reply to topic
MaoKo



Joined: 07 May 2019
Posts: 98
Location: Paris/French
MaoKo
Hi. I've written a small tool that create a lexer generator in fasmg from regex specification in fasmg. I can't post it yet because I use instruction not available in "official" fasmg. It's need to be patched. I generate an uncompressed raw DFA. Some knowledge about how regex is converted to DFA and how DFA work is very useful to understanding the tool. The format of the generated table is as follow:
Code:
_name_module:: ; define virtual block

; [TOKEN]

.EXAMPLE_1 = 0 ; define token to be processing by parser or other tools
.IDENTIFIER = 1
.COMMENT = 2
.ERROR = 3
.EOF = 4 ; note: ERROR and EOF are always generated by the tool, no need to define it

._state_table:
db 0,0 ..., 0 ; dead state
db 1,2, ...,0 ; state 1 is the first state. The whole line define 256 byte fo each possible input byte. This is disadvantage of uncompressed DFA

._string_table: ; Some debug table just for display
._accept_table:
db 2, .IDENTIFIER ; mark which state is final and what token it's recognize
db 0 ; always finish with 0

._skip_table:
db .COMMENT ; mark some token that can be skipped
db 0 ; always finish with 0
    


I will include the DFA for C language, if you want to test yourself.
Here is a simple lexer driven module in fasmg.
Code:
virtual at 0H
  include "token_c.inc"                                                                                                                            
end virtual        
                                    
virtual at 0H
_raw:: file "file.c" ; put your file here
_size = ($ - $$)
end virtual
                                    
macro _restore_state? _start?*, _last?*, _backup?*, _token?*, _count?*, _stream?*
    local _size, _symbol
    if (_start = 0H) ; reach DEAD state
        if (_last = 0H)
            display "lexing error", 00AH
        end if
        load _offset:4H from token_c: (token_c._string_table + (_token shl 2H))
        load _size:byte from token_c:_offset
        load _symbol:_size from token_c: (_offset+1H)
        display _symbol, 00AH
        if ((_token = token_c.IDENTIFIER) | (_token = token_c.STRING) | (_token = token_c.COMMENT))
            load _lexeme: (_backup - _first) from _raw:_first
            display 01BH, "[31m", string(_lexeme), 00AH, 01BH, "[0m"
        end if
        _start = 1H        
        _token = 0H
        _last = 0H
        _stream = _backup                                                
        _first = _stream                                                                                                                           
    end if                    
end macro        

_accept? := 0H
_index = 0H                                                              
while (not 0H)
    load _test:byte from token_c: (token_c._accept_table+(_index shl 1H))
    if (_test = 0H)
        break
    end if
    load _token:byte from token_c: ((token_c._accept_table+(_index shl 1H))+1H)
    rept 1H _:_test
        _accept?._? = _token
    end rept
    _index = _index + 1H
end while

_source_c = 0H
_backup = ""
_count = 0H
_start = 1H
_last = 0H
_token = 0H
_first = _source_c

while (not 0H)
    rept 1H _:_start
        if (definite _accept?._?)
            _last = _start
            _backup = _source_c
            _token = _accept?._?
        end if
    end rept
    if (_source_c >= _size)
        break
    end if
    load _current:byte from _raw:_source_c
    load _start:byte from token_c: (token_c._state_table+((_start shl 8H) + _current))
    _source_c = _source_c + 1H
    _restore_state _start, _last, _backup, _token, _count, _source_c
end while
_start = 0H
_restore_state _start, _last, _backup, _token, _count, _source_c
    

It can handle 5K~6K ligne of C in 3-4 second in the current fasmg, which IMO is very good. But of course you can use the table in pure assembler for better performance. Once you understand how to use the table, is very easy to adapt in a specific language.


Description: C DFA
Download
Filename: token_c.inc
Filesize: 130.34 KB
Downloaded: 115 Time(s)

Post 14 Jun 2021, 15:34
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.