The MATCH algorithm

As I've been working on extensions to the MATCH command in CALM, pushing the boundaries of what is possible under the constraints of my implementation, I realized that I never explained publicly what the assumptions were, and why I'm only ever extending MATCH abilities in very specific ways. If we had another fasmcon, it could be an interesting little talk.

The story begins with fasm 1.61.10, where the MATCH was first implemented, and from the very beginning it used the algorithm that I'm now going to explain.

Let's clarify the basics first: fasm's MATCH operates on tokenized text, it sees the tokens as indivisible units and can only compare whether they are identical or not. For the algorithm it doesn't really matter what the tokens actually are.

MATCH takes a pattern, which is a sequence of wildcard names and required tokens, and tries to match any given tokenized text with it. For example pattern:

Code:

a+b

has two wildcards, "a" and "b", and each of them can be matched with any token sequence, while "+" is a required token, and must be matched exactly.

The rule that I came up with when first implementing MATCH was that every consecutive wildcard matches some non-empty text, but as little text as possible. So if the above pattern was matched with "1+2+3" text, "a" would be matched with "1", while "b" would be forced to match the remaining "2+3". Other than that, a wildcard could be matched with absolutely anything.

This assumption allowed me to implement MATCH in quite efficient way. You see, when we match pattern elements consecutively and we reach a new wildcard, we can assume that what we have so far is not going to change. Any required tokens that we are going to look for must be somewhere forward, because what was matched earlier could not be matched with any shorter text. And whatever else comes our way can be consumed by the latest wildcard.

So, for example, if we need to match a "+", we can keep looking for it, and throw all the intervening tokens into the bag that is our last wildcard. And if there are two wildcards one after another, we can give the first one the bare minimum, a single token and then let the other one consume whatever we need to get off the way.

This allowed fasm to avoid exploring many options and my algorithm needed to remember just a single point in the pattern, the newest wildcard, to which fall back when there is a mismatch. It would then let this latest wildcard consume another token and see if that helps marching forward, and until another wildcard.

My implementation avoided increasing the exponent of algorithm complexity with each added wildcard, so the execution time never ran out of control, and I was very happy with it. Therefore, whenever I considered any extensions to the MATCH design, I needed them to preserve the basic assumptions.

First new feature came with fasmg's case-insensitive required tokens. This did not affect anything, the required portion either matches or not, and wildcard consume everything else without complaints.

Another fasmg's addition was the ability to require whitespace, but that was allowed by the fact that fasmg (unlike fasm) makes special tokens out of whitespace, so they can be treated just like any other. When not required, the whitespace tokens are discarded, and they do not count towards wildcard's minimum contents.

Then, with CALM, came optional wildcards. They differ from the classic ones in that they can accept empty text. This still does not break the assumptions: they initially match zero tokens instead of consuming a single one, but after that they behave exactly the same as regular ones. The minimal matched text becomes even shorter, so we are not going to miss any possible match of required tokens, and the last wildcard can still consume everything we throw its way.

Things became more interesting when I decided to somehow allow matching only the values with properly balanced opening and closing brackets. My thought process was like this: because the matching algorithm does not care what the tokens actually are, we can re-define what is considered a token for the entire run of the algorithm, and then all the assumptions should still work. Something starting with "(" is going to become one large pseudo-token up to, and including, the balancing ")". If we decide that all the wildcards can consume text only in such chunks, the only problem is that the literal portions of the pattern too need to be composed of such pseudo-tokens, otherwise the wildcard might get its pseudo-token boundaries misaligned. For simplicity (of both the definition and the implementation) I decided to strictly disallow use of the chosen brackets in pattern.

And finally came the specialized wildcards. I added the ability for a wildcard to match only a very specific entity, like symbol identifier or a maximal recognizable expression. How does that work? Doesn't it all fall apart if we redefine what is a token not globally, but only for an individual wildcard?

The trick is, I'm not treating these special wildcards as wildcards. They are like required token sequences, except that they are a bit more lenient in what sequences they accept, just like case-insensitive required tokens. And to make this work, a specialized wildcard needs to see only one possible matching option in the text it is given. An expression matcher cannot possibly match a shorter expression when a longer is available, because then it would have multiple options to explore, and my algorithm does not allow that. A specialized wildcard gets only a single opportunity to tell that it matches the text (and how many tokens it matches), there are no retries. And if it does not match, we go back to the last "true" wildcard, which can consume anything, and let it eat another token, and so on, as usual.

This creates additional constraints on the design: since a specialized wildcard is not allowed to have possibilities, it cannot be made optional. It also requires the true wildcards to accept any tokens that get rejected, so changing what is considered a token, like in the case of bracket balancing, is also ruled out.

And that's how I managed to add options that at first seemed irreconcilable with my "latest wildcard flag planting" approach. I hope this helps to understand the restrictions.

Let me try to summarize the assumptions:

The text is divided into tokens, according to some pre-selected rule, and these tokens can never be split by the algorithm, only consumed in entirety.
The pattern consists of freely-matched wildcards (capable of consuming any number of tokens) and strictly-matched required sub-patterns.
The freely-matched wildcards always consume as little text as possible, one complete token at a time.
When a required sub-pattern is tested against any text, it either rejects it, or matches an unequivocal number of whole tokens. There can be no other length of text that this sub-pattern could possibly match at this point.

The algorithm then proceeds as follows:

Start at the beginning of pattern, with no flag planted.
If current element of pattern is a freely-matched wildcard, plant the flag (remember the position in text, and this wildcard's position in pattern). Let it consume as many tokens as it initially requires (zero or one) and fail the matching if there is not enough tokens available. Proceed to the next element.
If current element is a required sub-pattern (in CALM MATCH this is either a sequence of literal matches or a specialized wildcard), attempt to match it with text. If it rejects the text, go back to the planted flag and let the freely-matched wildcard consume one more token. If there was no flag, fail the matching.
Continue handling the above two cases until the end of pattern is reached.
When a new flag is planted, it replaces the previous one. You never need to go back beyond it, as long as all the assumptions are satisfied.