flat assembler
Message board for the users of flat assembler.

Index > Tutorials and Examples > C lexer in fasmg

Author
Thread Post new topic Reply to topic
MaoKo



Joined: 07 May 2019
Posts: 100
Location: Paris/French
MaoKo 14 Jun 2021, 15:34
Hi. I've written a small tool that create a lexer generator in fasmg from regex specification in fasmg. I can't post it yet because I use instruction not available in "official" fasmg. It's need to be patched. I generate an uncompressed raw DFA. Some knowledge about how regex is converted to DFA and how DFA work is very useful to understanding the tool. The format of the generated table is as follow:
Code:
_name_module:: ; define virtual block

; [TOKEN]

.EXAMPLE_1 = 0 ; define token to be processing by parser or other tools
.IDENTIFIER = 1
.COMMENT = 2
.ERROR = 3
.EOF = 4 ; note: ERROR and EOF are always generated by the tool, no need to define it

._state_table:
db 0,0 ..., 0 ; dead state
db 1,2, ...,0 ; state 1 is the first state. The whole line define 256 byte fo each possible input byte. This is disadvantage of uncompressed DFA

._string_table: ; Some debug table just for display
._accept_table:
db 2, .IDENTIFIER ; mark which state is final and what token it's recognize
db 0 ; always finish with 0

._skip_table:
db .COMMENT ; mark some token that can be skipped
db 0 ; always finish with 0
    


I will include the DFA for C language, if you want to test yourself.
Here is a simple lexer driven module in fasmg.
Code:
virtual at 0H
  include "token_c.inc"                                                                                                                            
end virtual        
                                    
virtual at 0H
_raw:: file "file.c" ; put your file here
_size = ($ - $$)
end virtual
                                    
macro _restore_state? _start?*, _last?*, _backup?*, _token?*, _count?*, _stream?*
    local _size, _symbol
    if (_start = 0H) ; reach DEAD state
        if (_last = 0H)
            display "lexing error", 00AH
        end if
        load _offset:4H from token_c: (token_c._string_table + (_token shl 2H))
        load _size:byte from token_c:_offset
        load _symbol:_size from token_c: (_offset+1H)
        display _symbol, 00AH
        if ((_token = token_c.IDENTIFIER) | (_token = token_c.STRING) | (_token = token_c.COMMENT))
            load _lexeme: (_backup - _first) from _raw:_first
            display 01BH, "[31m", string(_lexeme), 00AH, 01BH, "[0m"
        end if
        _start = 1H        
        _token = 0H
        _last = 0H
        _stream = _backup                                                
        _first = _stream                                                                                                                           
    end if                    
end macro        

_accept? := 0H
_index = 0H                                                              
while (not 0H)
    load _test:byte from token_c: (token_c._accept_table+(_index shl 1H))
    if (_test = 0H)
        break
    end if
    load _token:byte from token_c: ((token_c._accept_table+(_index shl 1H))+1H)
    rept 1H _:_test
        _accept?._? = _token
    end rept
    _index = _index + 1H
end while

_source_c = 0H
_backup = ""
_count = 0H
_start = 1H
_last = 0H
_token = 0H
_first = _source_c

while (not 0H)
    rept 1H _:_start
        if (definite _accept?._?)
            _last = _start
            _backup = _source_c
            _token = _accept?._?
        end if
    end rept
    if (_source_c >= _size)
        break
    end if
    load _current:byte from _raw:_source_c
    load _start:byte from token_c: (token_c._state_table+((_start shl 8H) + _current))
    _source_c = _source_c + 1H
    _restore_state _start, _last, _backup, _token, _count, _source_c
end while
_start = 0H
_restore_state _start, _last, _backup, _token, _count, _source_c
    

It can handle 5K~6K ligne of C in 3-4 second in the current fasmg, which IMO is very good. But of course you can use the table in pure assembler for better performance. Once you understand how to use the table, is very easy to adapt in a specific language.


Description: C DFA
Download
Filename: token_c.inc
Filesize: 130.34 KB
Downloaded: 774 Time(s)

Post 14 Jun 2021, 15:34
View user's profile Send private message Visit poster's website Reply with quote
MaoKo



Joined: 07 May 2019
Posts: 100
Location: Paris/French
MaoKo 06 Dec 2021, 15:30
This is the full package for generating this:
https://github.com/MaoKo/gFGFX
The newest version of fasmg include an instruction `call` which can call calminstruction incredibly faster than the old asmcmd.
This allow me to build this project. Thanks you Privalov.
So with this, you can generate lexing table for pascal,c,custom language etc with the help of regex and automata. You can also use regex for other purpose as well:
Code:
include "xcalm.alm"
include "bitset.inc"
include "automata.inc"
include "regex.inc"

_regex?._declare? _vector, _constant?._false?
_regex?._compile? _vector, "(x|y|z)mm([0-9]|(1[0-9])|(2[0-9])|30|31)"

calminstruction _check_vector? _regex?*, _register?*
  local _return
  _init const __return, _return
  stringify _register
  call _regex?._match?, __return, _regex, _register
  cmp _return, _constant?._false?
  jyes _check_vector_finish
  _asmcmd =display "Match"
 _check_vector_finish:
end calminstruction

_check_vector _vector, xmm32
    


The regex library normally support union (|), concat, closure (*,+,?), dot, string, character class ([0-9], also posix range [[:alpha:][:upper:]]), operator between character class ({-},{+},{&}), sequence (a{3}, a{4,6}), regex option like (?i:abc), (?s:.*) and comment (?#COMMENT).
Regex alias are also supported like identifier equal "[[:alpha:][:digit:]_]+" and you can it in an "arith" regex like "{identifier}={identifier}[+\-*/]{identifier}".
Without calminstruction it will be impossible to build such tools. I will still try to optimize it even further. For example, it took 40 second to generate DFA for the c lang (token_c.asm).
I known, other tools exists and they are very fast. Principally I wanted to demonstrate a kind of POC for fasmg.

If you have idea of new feature don't hesitate to tell me. I hope, I could build a yacc-like to create a complete tools for lexing and parsing. This allow the possibility of creating small language inside assembly source and compiling (generating) it into assembly at the same time as regular assembly instruction.
Post 06 Dec 2021, 15:30
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.