flat assembler
Message board for the users of flat assembler.
Index
> Main > regex and glob |
Author |
|
sylware 18 Jul 2022, 20:14
Is anybody aware of x64 assembly implementations of
|
|||
18 Jul 2022, 20:14 |
|
bitRAKE 18 Jul 2022, 22:39
https://github.com/bitRAKE/fasmg_playground/blob/master/string/RegEx__Simple.g
You are probably looking for something more complex. Look how small the code is though. (converts easily to UTF-16) |
|||
18 Jul 2022, 22:39 |
|
redsock 19 Jul 2022, 19:46
I have written a complete regex engine for my UTF32 requirements, though I didn't do it in assembly language The best place to start IMO is https://swtch.com/~rsc/regexp/regexp1.html. Russ Cox also wrote Google's RE2 (https://github.com/google/re2) implementation, and there is a lot of fascinating history that he covers in those 4 series on regular expressions.
Also a mandatory reference while you're at it is burntsushi's rust regexp crate https://github.com/BurntSushi/regexp, if you haven't already used ripgrep, highly recommend. I look forward to seeing your assembly language version! |
|||
19 Jul 2022, 19:46 |
|
sylware 19 Jul 2022, 21:20
I was looking for an already existing engine. I am currently "coding" a huge thing, dunno when I'll be finished. But if one day I am writting one, I would probably "hand compile" bellard one. |
|||
19 Jul 2022, 21:20 |
|
bitRAKE 20 Jul 2022, 12:15
The minimum regex VM I could come up with:
Code: ; RAX: next ; RSI: VM byte code ; RDI: string pointer split: push rdi rsi ; store VM state lodsb ; skip branch offset mov al, next - split call rax pop rsi rdi ; restore VM state jump: movsx rcx,byte [rsi] ; load relative offset lea rsi,[rsi+rcx*2+1] next: lodsb jmp rax char: cmpsb jz next retn match: lodsb ; skip extra byte ; use match result retn The VM program to match 'a+b+' would be: Code: db 0x15, 'a' ; char 'a' db 0x00, -2 ; split db 0x15, 'b' ; char 'b' db 0x00, -2 ; split db 0x19, 0x12 ; match, next Code: lodsb
movzx ecx,al
repz cmpsb
jz next
retn _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
20 Jul 2022, 12:15 |
|
Picnic 21 Jul 2022, 05:55
bitRAKE wrote: https://github.com/bitRAKE/fasmg_playground/blob/master/string/RegEx__Simple.g Hi bitRAKE, It looks very handy and fits my simple needs, I would like to experiment more. Few usage examples would be useful for the ignorant. As a wise man said: bitRAKE need to write a test for regex |
|||
21 Jul 2022, 05:55 |
|
bitRAKE 21 Jul 2022, 14:10
Picnic wrote: Few usage examples would be useful for the ignorant. As a wise man said: bitRAKE need to write a test for regex Russ Cox's articles link to some test cases. So, I should be able to get around to delving deeper into this. If you don't mind working from the debugger, a simple test program would be: [code updated below] Using my current build system. I wouldn't be surprised if there are errors. One thing lacking is the length of the sub-string match - which seems really useful. Edit: I did find an error and added the end address of match returned in RDX, and updated test to reflect that change: Code: format PE64 CONSOLE 6.2 at 0x1234_56780000 include 'umbrella.inc.g' include 'glob.inc' glob_testing: entry $ push rsi rdi iterate <PATTERN, CASE, EXPECTED, LENGTH>,\ '^', '', 0, 0,\ '^a', 'ba', -1, -1,\ '^.*c', 'abcd', 0, 3,\ '$', '', 0, 0,\ 'a$', 'ab', -1, -1,\ 'b.*$', 'abcd', 1, 3,\ 'a*$', 'aaaa', 0, 4,\ 'y.*$', 'a any b', 4, 3,\ '.b', 'aab', 1, 2,\ \; non-greedy matching returns least possible length: 'a*', 'aaaa', 0, 0,\ '.*', 'any', 0, 0,\ 'y.*', 'a any b', 4, 1,\ collect CONST.1 glob_testing.pattern.% db PATTERN,0 glob_testing.case.% db CASE,0 end collect lea rsi,[.pattern.%] lea rdi,[.case.%] push rdi call RegEx__Glob pop rax if EXPECTED < 0 ; should not match, fail if it does jz .fail else ; should match, verify location & length of match jnz .fail sub edx,edi cmp edx,LENGTH jnz .fail sub edi,eax cmp edi,EXPECTED jnz .fail end if end iterate pop rdi rsi retn .fail: int3 ; should be in debugger, RAX is failed test case _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
21 Jul 2022, 14:10 |
|
Picnic 22 Jul 2022, 07:05
bitRAKE wrote: If you don't mind working from the debugger Not at all. Very helpful example, the code was updated, an error was corrected, it's more than expected. Thank you bitRAKE. I will try to test embed a simple regex command in my interpreter. I will let you know if I have any questions. |
|||
22 Jul 2022, 07:05 |
|
bitRAKE 23 Jul 2022, 00:05
Picnic, if you read the source article you'll see it's more of a teaching example, and is fairly feature limited. In many ways expanding it would require a complete rewrite.
Speaking about implementation of the more general regex: Even though I don't like the POSIX regex completely, several features are desirable to me: captures, options, and character groups. Much of the syntax is just a simpler form of general syntax: like (){m,n} being able to replace ()? ()* ()+. The parser reduces this complexity. Initially, I thought the VM was silly - beyond abstracting away some complexity it didn't seem terribly useful. Then the idea of multiple VMs running the same code - completely changes things, as each VM has different performance characteristics and feature sets. These are small VMs - could fix half a dozen in 4K. Which makes my first approximation above naïve. Not that it's impossible to construct VMs in that form. It's more about the added constraint of all the instructions having the same entry offset. Something to revisit after the parser is working. I have ideas for many of needed features in general VM. If those entry points fit in 256 bytes then the others could just be a reduction of that. For example, the captures create additional state to preserve. The trick with using ENTER 0,31 as a copy to stack is sufficient. People might balk about the performance though, but that is more of an Intel problem. _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
23 Jul 2022, 00:05 |
|
revolution 23 Jul 2022, 01:59
bitRAKE wrote: The trick with using ENTER 0,31 as a copy to stack is sufficient. People might balk about the performance though, but that is more of an Intel problem. I like the idea of regex, but there is no way I want to use it where performance matters. |
|||
23 Jul 2022, 01:59 |
|
Picnic 23 Jul 2022, 10:43
bitRAKE wrote: Picnic, if you read the source article you'll see it's more of a teaching example, and is fairly feature limited. In many ways expanding it would require a complete rewrite. Yes it is, but still, its a tiny example, fun to play. Some years ago i saw mrpink's expression engine re4asm so i have a general idea about how a regex implementasion is. |
|||
23 Jul 2022, 10:43 |
|
bitRAKE 23 Jul 2022, 13:29
Very nice work by mrpink - I hadn't seen that before. POSIX ERE, documentation, very thorough error checking, and processor flexible (nothing heavier than a MOVZX).
_________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
23 Jul 2022, 13:29 |
|
sylware 23 Jul 2022, 13:33
re4asm would be a good basis for a complete support of POSIX ERE!
I'll try to keep that in mind. |
|||
23 Jul 2022, 13:33 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.