flat assembler
Message board for the users of flat assembler.

Index > Main > regex and glob

Author
Thread Post new topic Reply to topic
sylware



Joined: 23 Oct 2020
Posts: 462
Location: Marseille/France
sylware 18 Jul 2022, 20:14
Is anybody aware of x64 assembly implementations of
  • extended posix regex engine
  • glob matching engine
Post 18 Jul 2022, 20:14
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4075
Location: vpcmpistri
bitRAKE 18 Jul 2022, 22:39
https://github.com/bitRAKE/fasmg_playground/blob/master/string/RegEx__Simple.g
You are probably looking for something more complex.
Look how small the code is though. Razz
(converts easily to UTF-16)
Post 18 Jul 2022, 22:39
View user's profile Send private message Visit poster's website Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 462
Location: Marseille/France
sylware 19 Jul 2022, 02:21
yeah, actually I am looking for full POSIX compliance (and maybe some benign extensions), let's say, for the regexp engine I am thinking about an assembly sed.
Post 19 Jul 2022, 02:21
View user's profile Send private message Reply with quote
redsock



Joined: 09 Oct 2009
Posts: 435
Location: Australia
redsock 19 Jul 2022, 19:46
I have written a complete regex engine for my UTF32 requirements, though I didn't do it in assembly language Smile The best place to start IMO is https://swtch.com/~rsc/regexp/regexp1.html. Russ Cox also wrote Google's RE2 (https://github.com/google/re2) implementation, and there is a lot of fascinating history that he covers in those 4 series on regular expressions.

Also a mandatory reference while you're at it is burntsushi's rust regexp crate https://github.com/BurntSushi/regexp, if you haven't already used ripgrep, highly recommend.

I look forward to seeing your assembly language version!

_________________
2 Ton Digital - https://2ton.com.au/
Post 19 Jul 2022, 19:46
View user's profile Send private message Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 462
Location: Marseille/France
sylware 19 Jul 2022, 21:20
Smile

I was looking for an already existing engine.

I am currently "coding" a huge thing, dunno when I'll be finished.

But if one day I am writting one, I would probably "hand compile" bellard one.
Post 19 Jul 2022, 21:20
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4075
Location: vpcmpistri
bitRAKE 20 Jul 2022, 12:15
The minimum regex VM I could come up with:
Code:
;       RAX: next
;       RSI: VM byte code
;       RDI: string pointer

split:
        push rdi rsi                    ; store VM state
        lodsb                           ; skip branch offset
        mov al, next - split
        call rax
        pop rsi rdi                     ; restore VM state
jump:
        movsx rcx,byte [rsi]            ; load relative offset
        lea rsi,[rsi+rcx*2+1]
next:
        lodsb
        jmp rax

char:   cmpsb
        jz next
        retn

match:
        lodsb                           ; skip extra byte
        ; use match result
        retn    
... all the instructions are two bytes. Opcodes are the low byte of instruction address (assume code is positioned correctly, split would be at address 0x?..00). Large patterns would need a lot of jump instructions. Plenty of room to add more instructions though. Razz

The VM program to match 'a+b+' would be:
Code:
db 0x15, 'a'    ; char 'a'
db 0x00, -2     ; split
db 0x15, 'b'    ; char 'b'
db 0x00, -2     ; split
db 0x19, 0x12   ; match, next    
There are some tricks to make note of. For example, if we have a multi-byte compare instruction:
Code:
        lodsb
        movzx ecx,al
        repz cmpsb
        jz next
        retn    
... to match odd lengths, we would need to pad VM byte code with next instruction to skip a byte - keeping the instruction stream even. (So, the single byte match doesn't need LODSB to skip the extra byte.)

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 20 Jul 2022, 12:15
View user's profile Send private message Visit poster's website Reply with quote
Picnic



Joined: 05 May 2007
Posts: 1404
Location: Piraeus, Greece
Picnic 21 Jul 2022, 05:55
bitRAKE wrote:
https://github.com/bitRAKE/fasmg_playground/blob/master/string/RegEx__Simple.g
You are probably looking for something more complex.
Look how small the code is though. Razz
(converts easily to UTF-16)


Hi bitRAKE,

It looks very handy and fits my simple needs, I would like to experiment more.

Few usage examples would be useful for the ignorant. As a wise man said: bitRAKE need to write a test for regex Wink
Post 21 Jul 2022, 05:55
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4075
Location: vpcmpistri
bitRAKE 21 Jul 2022, 14:10
Picnic wrote:
Few usage examples would be useful for the ignorant. As a wise man said: bitRAKE need to write a test for regex Wink
... yes, that would be me.

Russ Cox's articles link to some test cases. So, I should be able to get around to delving deeper into this.

If you don't mind working from the debugger, a simple test program would be:

[code updated below]

Using my current build system. I wouldn't be surprised if there are errors. One thing lacking is the length of the sub-string match - which seems really useful.

Edit: I did find an error and added the end address of match returned in RDX, and updated test to reflect that change:
Code:
format PE64 CONSOLE 6.2 at 0x1234_56780000
include 'umbrella.inc.g'

include 'glob.inc'
glob_testing: entry $
        push rsi rdi

        iterate <PATTERN,               CASE,           EXPECTED, LENGTH>,\
                '^',                    '',                     0,      0,\
                '^a',                   'ba',                   -1,     -1,\
                '^.*c',                 'abcd',                 0,      3,\
                '$',                    '',                     0,      0,\
                'a$',                   'ab',                   -1,     -1,\
                'b.*$',                 'abcd',                 1,      3,\
                'a*$',                  'aaaa',                 0,      4,\
                'y.*$',                 'a any b',              4,      3,\
                '.b',                   'aab',                  1,      2,\
                \; non-greedy matching returns least possible length:
                'a*',                   'aaaa',                 0,      0,\
                '.*',                   'any',                  0,      0,\
                'y.*',                  'a any b',              4,      1,\

                collect CONST.1
                        glob_testing.pattern.% db PATTERN,0
                        glob_testing.case.% db CASE,0
                end collect

                lea rsi,[.pattern.%]
                lea rdi,[.case.%]
                push rdi
                call RegEx__Glob
                pop rax
                if EXPECTED < 0 ; should not match, fail if it does
                        jz .fail
                else ; should match, verify location & length of match
                        jnz .fail
                        sub edx,edi
                        cmp edx,LENGTH
                        jnz .fail
                        sub edi,eax
                        cmp edi,EXPECTED
                        jnz .fail
                end if
        end iterate

        pop rdi rsi
        retn

.fail:
        int3 ; should be in debugger, RAX is failed test case    
I'm a little more confident it could be closer to correct. '+' can be added with four instructions.

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 21 Jul 2022, 14:10
View user's profile Send private message Visit poster's website Reply with quote
Picnic



Joined: 05 May 2007
Posts: 1404
Location: Piraeus, Greece
Picnic 22 Jul 2022, 07:05
bitRAKE wrote:
If you don't mind working from the debugger


Not at all. Very helpful example, the code was updated, an error was corrected, it's more than expected. Thank you bitRAKE.
I will try to test embed a simple regex command in my interpreter. I will let you know if I have any questions.
Post 22 Jul 2022, 07:05
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4075
Location: vpcmpistri
bitRAKE 23 Jul 2022, 00:05
Picnic, if you read the source article you'll see it's more of a teaching example, and is fairly feature limited. In many ways expanding it would require a complete rewrite.


Speaking about implementation of the more general regex:

Even though I don't like the POSIX regex completely, several features are desirable to me: captures, options, and character groups. Much of the syntax is just a simpler form of general syntax: like (){m,n} being able to replace ()? ()* ()+. The parser reduces this complexity.

Initially, I thought the VM was silly - beyond abstracting away some complexity it didn't seem terribly useful. Then the idea of multiple VMs running the same code - completely changes things, as each VM has different performance characteristics and feature sets. These are small VMs - could fix half a dozen in 4K.

Which makes my first approximation above naïve. Not that it's impossible to construct VMs in that form. It's more about the added constraint of all the instructions having the same entry offset. Something to revisit after the parser is working. I have ideas for many of needed features in general VM. If those entry points fit in 256 bytes then the others could just be a reduction of that.

For example, the captures create additional state to preserve. The trick with using ENTER 0,31 as a copy to stack is sufficient. People might balk about the performance though, but that is more of an Intel problem.

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 23 Jul 2022, 00:05
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20453
Location: In your JS exploiting you and your system
revolution 23 Jul 2022, 01:59
bitRAKE wrote:
The trick with using ENTER 0,31 as a copy to stack is sufficient. People might balk about the performance though, but that is more of an Intel problem.
If you have performance issues and it is because of using regex, then you have done something terribly wrong.

I like the idea of regex, but there is no way I want to use it where performance matters.
Post 23 Jul 2022, 01:59
View user's profile Send private message Visit poster's website Reply with quote
Picnic



Joined: 05 May 2007
Posts: 1404
Location: Piraeus, Greece
Picnic 23 Jul 2022, 10:43
bitRAKE wrote:
Picnic, if you read the source article you'll see it's more of a teaching example, and is fairly feature limited. In many ways expanding it would require a complete rewrite.


Yes it is, but still, its a tiny example, fun to play.
Some years ago i saw mrpink's expression engine re4asm so i have a general idea about how a regex implementasion is.
Post 23 Jul 2022, 10:43
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4075
Location: vpcmpistri
bitRAKE 23 Jul 2022, 13:29
Very nice work by mrpink - I hadn't seen that before. POSIX ERE, documentation, very thorough error checking, and processor flexible (nothing heavier than a MOVZX).

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup
Post 23 Jul 2022, 13:29
View user's profile Send private message Visit poster's website Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 462
Location: Marseille/France
sylware 23 Jul 2022, 13:33
re4asm would be a good basis for a complete support of POSIX ERE!

I'll try to keep that in mind.
Post 23 Jul 2022, 13:33
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.