re4asm reference manual

The regcomp() function

#include "regex.h"

int regcomp(regex_t *preg, const char *regex, int cflags);

The regcomp() function compiles the regex string pointed to by regex to an internal representation and stores the result in the pattern buffer structure pointed to by preg.

The cflags argument is a the bitwise inclusive OR of zero or more of the following flags (defined in the header "regex.h"):

REG_EXTENDED
Use POSIX Extended Regular Expression (ERE) compatible syntax when compiling regex.
This flag is always set since re4asm only supports ERE.
REG_ICASE
Ignore case. Subsequent searches with the regexec function using this pattern buffer will be case insensitive.
REG_NOSUB
Do not report submatches. Subsequent searches with the regexec function will only report whether a match was found or not and will not fill the match array.
REG_NEWLINE
Normally the newline character is treated as an ordinary character. When this flag is used, the newline character ('\n', ASCII code 10) is treated specially as follows:
  1. The match-any-character operator (dot "." outside a bracket expression) does not match a newline.
  2. A non-matching list ([^...]) not containing a newline does not match a newline.
  3. The match-beginning-of-line operator ^ matches the empty string immediately after a newline as well as the empty string at the beginning of the string (but see the REG_NOTBOL regexec() flag below).
  4. The match-end-of-line operator $ matches the empty string immediately before a newline as well as the empty string at the end of the string (but see the REG_NOTEOL regexec() flag below).
REG_SETMINLEN
Sets the minlen field in preg.

After a successful call to regcomp it is possible to use the preg pattern buffer for searching for matches in strings (see below).

The regex_t structure has the following fields that the application can read:

size_t re_nsub
Number of parenthesized subexpressions in regex.
size_t minlen
Minimum length of a match. The data in this field is valid only if the REG_SETMINLEN flag was passed to regcomp.
If minlen < string, then string can safely be skipped.

The regcomp function returns zero if the compilation was successful, or one of the following (POSIX) error codes if there was an error:

REG_BADPAT
Invalid regexp.
REG_ECOLLATE
Invalid collating element referenced.
re4asm returns this whenever equivalence classes or multicharacter collating elements are used in bracket expressions (they are not supported).
REG_ECTYPE
Unknown character class name in [[:name:]].
REG_EESCAPE
The last character of regex was a backslash (\).
REG_ESUBREG
Invalid back reference; number in \digit invalid.
REG_EBRACK
[] imbalance.
REG_EPAREN
\(\) or () imbalance.
REG_EBRACE
\{\} or {} imbalance.
REG_BADBR
{} content invalid: not a number, more than two numbers, first larger than second, or number too large.
REG_ERANGE
Invalid character range, e.g. ending point is earlier in the collating order than the starting point.
REG_ESPACE
Out of memory, or an internal limit exceeded.
REG_BADRPT
Invalid use of repetition operators: two or more repetition operators have been chained in an undefined way.

The regcomp function also returns the following (non-POSIX) error codes if there was an error:

REG_INTERNAL
Internal error (aka "bug").
REG_STATES
NFA state limit exceeded. The current implementation of re4asm supports only patterns that have a character width of less than 33.
REG_EMPTYPAT
regex is the empty string or evaluates to an empty regular expression.
REG_EMPTYSET
A bracket expression [] evaluates to an empty set.
REG_INVCHAR
regex contains a non-ASCII character, i.e., a character with an ASCII code greater than 127.
REG_UNIONOP
Misplaced | operator.
REG_NESTLEVEL
Parenthesis () nesting level exceeded.
REG_EMPTYPARENS
Empty pair of parenthesis ().
REG_ANCHINPAREN
Anchor (^ or $) within parenthesis ()
REG_ANCHINEXPR
Anchor (^ or $) within regular expression, i.e., not first or last character of regular expression.
REG_ANCHALONE
Stand-alone anchor, i.e., ^ not followed by anything, or $ not preceded by anything.

The regexec() function

#include "regex.h"

int regexec(const regex_t *preg, const char *string, size_t nmatch,
            regmatch_t pmatch[], int eflags);

The regexec() function matches the null-terminated string against the compiled regexp preg, initialized by a previous call to to the regcomp function. The eflags argument is a bitwise OR of zero or more of the following flags:

REG_NOTBOL

When this flag is used, the match-beginning-of-line operator ^ does not match the empty string at the beginning of string. If REG_NEWLINE was used when compiling preg the empty string immediately after a newline character will still be matched.

REG_NOTEOL

When this flag is used, the match-end-of-line operator $ does not match the empty string at the end of string. If REG_NEWLINE was used when compiling preg the empty string immediately before a newline character will still be matched.

These flags are useful when different portions of a string are passed to regexec and the beginning or end of the partial string should not be interpreted as the beginning or end of a line.

If REG_NOSUB was used when compiling preg, nmatch is zero, or pmatch is NULL, then the pmatch argument is ignored.
Otherwise, the start and end of the entire match is filled in the first element of pmatch.
Since re4asm does not support submatch addressing, setting nmatch > 1 and passing more than one regmatch_t structure in pmatch makes no sense.

The regmatch_t structure contains at least the following fields:

regoff_t rm_so
Offset from start of string to start of substring.
regoff_t rm_eo
Offset from start of string to the first character after the substring.

The length of a match can be computed by subtracting rm_so from rm_eo.

The regexec() function returns zero if a match was found, otherwise it returns REG_NOMATCH to indicate no match.

The regerror() function

#include "regex.h"

size_t regerror(int errcode, const regex_t *preg, char *errbuf, size_t errbuf_size);

The regerror() function is used to turn the error codes that can be returned by both regcomp and regexec into error message strings.

regerror() is passed the error code, errcode, the pattern buffer, preg, a pointer to a character string buffer, errbuf, and the size of the string buffer, errbuf_size. It returns the size of the errbuf required to contain the null-terminated error message string. If both errbuf and errbuf_size are nonzero, errbuf is filled in with the first errbuf_size - 1 characters of the error message and a terminating null.

The regfree() function

#include "regex.h"

void regfree(regex_t *preg);

The regfree() function is (usually) used to free the memory allocated by regcomp.
In re4asm the size of regex_t is constant and neither regcomp() nor regexec() allocate memory for it.
It is therefore not necessary to call regfree(), which is just a stub function doing nothing but immediately return und exists only for completeness/compatibility.