
re4asm -- a tiny simple regular expression library
==================================================

re4asm is a small regular expression library that can be used to search
ASCII text for patterns. re4asm uses Simple Regular Expressions
(SRE) which are a proper subset of the Extended Regular Expressions
(ERE) as defined by POSIX. (ERE is the syntax used by the command 
grep -E (aka egrep).)

The functions provided by re4asm are the same as the POSIX functions:
	- regcomp
	- regexec
	- regerror
	- regfree

Features
========

(1) The compiled regex has constant size (4 kB)
(2) No dependencies


Limitations
===========

For a technical reason (cpu register width) re4asm supports only patterns
having a character width of less than 33.
The character width of a pattern is the number of character positions, i.e.,
characters, dots and character sets.
The pattern "(a|b)*abb" has character width 5 and the pattern
"[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?" has character width 7.

As already mentioned, SREs are a subset of EREs. The grammar of EREs
has been restricted as follows:

- the "^", if used as an anchor, must appear at the start
  (as the very first character) of a simple regular expression
  valid	 : ^abc|bcd|^cde
  invalid: a*^bcd

- the "$", if used as an anchor, must appear at the end
  (as the very last character) of a simple regular expression
  valid	 : abc$|bcd$|cde
  invalid: bc$d?

- anchors ("^" and "$") are not supported within parentheses
  valid	 : ^(abc|bcd)|(cde|fgh)$
  invalid: (^bc$)

- all other occurrences of "^" and "$" (except within "[" "]")
  must be escaped ("\^" and "\$")

- standalone anchors are not supported
  valid	 : ^(abc|bcd)|cde$
  invalid: abc|^|cde or alk|dlsafj|$ or ^ 

- "^$" is supported because it seems to be very popular

- empty pair(s) of parentheses "()" are not supported

- a repetition count of 0, e.g. "A{0}", is not supported

- sub-match-addressing is not supported

- collating sequences [[..]] are recognized but not supported

- equivalence classes [[==]] are recognized but not supported


re4asm patterns (see also doc/sre.y)
====================================

'x'	   	match the character 'x'

'.'	   	any character (byte) except newline (CRLF, \x0D\x0A on Windows)

'[xyz]'    	a "character class" in this case, the pattern matches either an
		'x', a 'y', or a 'z'

'[xyz-]'   	a "character class" in this case, the pattern matches either an
		'x', a 'y', a 'z',  or a "-" which must be in the last position

'[]xyz]'   	a "character class" in this case, the pattern matches either an
		'x', a 'y', a 'z',  or a ']' which must be in the first position

'[abj-oZ]' 	a "character class" with a range in it, matches an 'a', a 'b', any
		letter from 'j' through 'o', or a 'Z'

'[^A-Z]'   	a "negated character class", i.e., any character but those in the
		class.  In this case, any character EXCEPT an uppercase letter.

'[^]A-Z]'  	a "negated character class", i.e., any character but those in the
		class.  The ']' must immediately follow the '^' operator.

'R*'	   	zero or more R's, where R is any regular expression

'R?'	   	zero or one R (that is, "an optional R")

'R+'	   	one or more R's, where R is any regular expression

'\X'	   	if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', then the ANSI-C
		interpretation of \X.  Otherwise, a literal 'X'.
		(used to escape operators such as '*' or the wildcard symbol '.')

'\123'	   	the character with octal value 123

'\x2A'	   	the character with hexadecimal value '2A'

'(R)'	   	match an R, parentheses are used to override (operator) precedence

'RS'	   	the regular expression R followed by the regular expression S

'R|S'	   	either the regular expression R or the regular expression S

'^R'	   	match an R but only at the start of a line

'R$'	   	match an R but only at the end of a line


Notes
=====

Within a character class the ".", "?", "*", "+", "|", "[", "(" and ")"
operators (or special symbols) loose their usual meaning and are treated
as ordinary characters.

A ")" is treated as an ordinary character if no matching "(" exists, e.g.,
in the pattern "xy)z" the ")" is an ordinary character and the pattern
matches "xy)z".
Generally it is a good idea to escape it anyway.


Files
=====

The implementation consists of the following files:
	
	ctypedef.inc:
		In this file, the constants for for C character classes like
		UPPER, LOWER, ALNUM, ..., are defined.
		This file is usually included at the top of your project or
		at least ahead of re4asm.inc. 
		
	re4asm.inc:
		This file contains all the code and is usually included into
		the code section.
		
	ctypemap.inc:
		Character classification lookup table which is used together
		with ctypemap.inc and should be included into the data section.
		
	classmap.inc:
		Regex character class definitions which should be included
		into the data section.
		
	regerr.inc:
		Error messages for the regerr() function which should be
		included into the data section.
		

Usage
=====

To get an idea of how to use this package take a look at the ./linux 
and ./windows folders.


References
==========

The implementation of regcomp and regexec is derived from the description
given in the following paper (Sections 1,2,3,4, pp. 1-12):

Gonzalo Navarro and Mathieu Raffinot.
New Techniques for Regular Expression Searching.
Algorithmica 41(2):89-116, 2004.


The defininition of POSIX ERE can be found at:

http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html

Man pages:

regex(3)
regex(7)

Misc:

http://www2.research.att.com/~gsf/testregex/
