Macros for parsing and data retrieval by Massimiliano Tomassoli 2009 (v. 1.0) ------------------- Regular Expressions ------------------- A language is just a set of sentences. This set can be defined in many ways. The obvious way is to list all its elements. Another way to use a regular expression. Note that regular expressions can represent only a limited family of languages, but we will soon get rid of that limitation. A sentence of a language is a sequence of symbols chosen from a given alphabet. Due to Fasm's fixed tokenization, our alphabet will be composed of tokens. But for now let's stick to the usual theory. Let's see a few regular languages over the alphabet {a,b,c,...,z} and their respective regular expressions. 1) L1 = {a, aa, ab} re1 = a|a~a|a~b | is the "or" operator and ~ the "and" operator. Since ~ has the highest precedence, no brackets are needed and we may also write re1 = a|aa|ab What it means should be obvious: re1 accepts sentences that a) are equal to "a" or b) are equal to "aa" or c) are equal to "ab" Regular expressions are not unique. Here's another one equivalent to the first: re1b = a(blank|a|b) Let's define "blank" as the neutral element of the operator ~, i.e. ~blank = and blank~ = (as in 5*1 = 1*5 = 5). re1b accepts sentences that starts with "a" followed (only) by a) nothing or b) "a" or c) "b". Of course, the brackets are special characters which don't belong to our alphabet so we can use them safely. 2) L2 = {a, aa, aaa, aaaa, aaaaa, ...} Yes: L2 is an infinite language. We cannot describe it just with | and ~. We need the operator * defined this way: * = blank||||... It might seem like a joke, but when sub is a complex subexpression things get interesting. re2 = a~(a*) = aa* Another option is re2b = a+ where + is defined as follows: + = * The operators seen so far listed in increasing order of precedence are: | ~ *,+ (same precedence) 3) Let's add " " (space) to our alphabet. L3 = {push eax, push ebx, push ecx, push edx, push eax eax, push eax ebx, ... push edx edx, push eax eax eax, push eax eax ebx, ... push edx edx edx} re3 = push( e(a|b|c|d|)x)+ Note that "push (" and "push(" are two different things: spaces are honored in our alphabet. I admit that the definition of L3 is a little bit ambiguous for what concerns spaces, but these are just minor details. Also note that re3 is a formal definition of (my idea of) L3 while the first one is not because "..." is ambiguous. --------------- Parsing in Fasm --------------- There are three important "variables": curExpr it's equal to the expression to be parsed; readSuccess it's 1 if the last parsing was successful, 0 otherwise; outData it's equal to the output of the parsing. Parsing commands "consume" curExpr and write their output to outData. I will also say that a command "returns" something. If readSuccess is 1 when a parsing command CMD is invoked, CMD starts to parse curExpr. If CMD fails, it sets readSuccess to 0 and outData to blank (with "outData equ"), otherwise it consumes curExpr and update outData. If readSuccess is 0 when CMD is invoked, CMD modifies NEITHER readSuccess NOR curExpr. outData might or might not be modified, therefore do NOT make particular assumptions about it. No parsing command will ever set readSuccess to 1 if it was 0 when the command was issued. There are many commands (they're macros): readCCurly readOpenAngle readClosedAngle readAtom name readEmpty readEnd readSymbol readQuoted readVar varName assertEqual command, [names] assertDiff command, [names] readUntilPossible command readOneThenUntilPossible command andRead [commands] orRead [commands] alsoRead [commands] thruRead [commands] optRead command saveOutData dest, command appendOutData dest, command expandDef defName Please note that any kind of parsing carried out by these macros cannot override in any way Fasm behaviour. For instance, "ebx" will be parsed as a single symbol, no matter what you do. On the other hand, "eax,3" will always be parsed as three symbols. Let's start with something easy. readSuccess equ 1 curExpr equ push eax readSymbol As I told before, if readSuccess is 0, readAtom do nothing, so the first line is needed. The expression to be parsed is "push eax". The instruction readAtom reads a symbol, extracts it from curExpr and returns it in outData. In the example above, after readSymbol, curExpr = eax outData = push readSuccess = 1 readSymbol fails only if called when curExpr is blank. The piece of code readSuccess equ 1 curExpr equ push eax readSymbol readSymbol does exactly what you think: it reads two symbol, extract them from curExpr and put the last one (the first is overwritten) in outData. readSuccess won't be modified by readSymbol. In other words, after that piece of code, curExpr = ; blank outData = eax readSuccess = 1 Now consider a similar piece of code: readSuccess equ 1 curExpr equ push eax andRead readSymbol, readSymbol Macro andRead invokes the commands passed as arguments one by one but only while readSuccess is 1. For instance, if the first ReadSymbol fails, the second one is not invoked. To apreciate the difference, let's consider the following piece of code: readSuccess equ 1 curExpr equ eax ebx readAtom eax readAtom ecx From now on I'll say that a command "accepts" an expression if it successfully parses it. readAtom only accepts the expression . In particular, "readAtom eax" only accepts "eax". This is what happens: readSuccess equ 1 curExpr equ eax ebx readAtom eax { curExpr equ ebx outData equ eax } readAtom ecx { outData equ readSuccess equ 0 } Now consider this different piece of code: readSuccess equ 1 curExpr equ eax ebx readAtom eax readAtom ebx What happens is this: readSuccess equ 1 curExpr equ eax ebx readAtom eax { curExpr equ ebx outData equ eax } readAtom ebx { curExpr equ outData equ ebx } The main problem is that now outData is equal to "ebx" and not "eax ebx" as one would expect. The right solution is to use andRead: readSuccess equ 1 curExpr equ eax ebx andRead readAtom eax, readAtom ebx This gives us the expected result: curExpr equ outData equ eax ebx andRead does another very important thing: if it fails it doesn't modify curExpr at all. That is, andRead becomes an atomic command and, as such, complies with the simple rules of parsing commands. Let's be pedantic: "readAtom eax" is a command, "readAtom eax readAtom ebx" are two commands, "andRead readAtom eax, readAtom ebx" is a SINGLE (compound) command. ------------------------------- Regular Expression with symbols ------------------------------- Since FASM does tokenization for us, we have to use an infinite alphabet made of symbols! But there are things we can't do. For instance, we can't accepts all and only the languages made of words which starts with the letter "a". The word "abacus" is read as a single symbol so there is no way we can recognize it as a word that starts with the letter "a". All we know is that "abacus" is different from "chopper" and equal to "abacus" (obviously). But we have no problems with special symbols like ( ) [ ] { } < > , + * ~, etc... To tell you the truth I did have some problem with } > and commas... ----------------------------------- Commands with a (brief) description ----------------------------------- All commands, if successful, remove the subexpression accepted from curExpr and returns some output in outData (they usually returns what they extracted from curExpr). If they fail, they set outData to the empty string (with "outData equ"). readCCurly Accepts and returns the closed curly brace '}'. readOpenAngle Accepts and returns the open angle bracket '<'. readClosedAngle Accepts and returns the closed angle bracket '>'. readAtom name Accepts and returns the symbol name. For instance readAtom apple accepts and returns "apple" readEmpty Does nothing (accepts the empty string). readEnd Like the other commands, if readSuccess is 0 it does nothing. Otherwise, it sets outData to the empty string and, if curExpr is NOT blank, also sets readSuccess to 0. readSymbol Accepts and returns any (non empty) single symbol. readQuoted Accepts and returns quoted strings (e.g. "this is a quoted string"). readVar varName This is an exception. It doesn't modify curExpr. If the variable var@varName is undefined, it does nothing; otherwise, it set outData to the value of var@varName. assertEqual command, [names] Invokes command and, if the value of outData is NOT in (the list of) names, restores curExpr, blanks outData and sets readSuccess to 0. assertDiff command, [names] Invokes command and, if the value of outData is in (the list of) names, restores curExpr, blanks outData and sets readSuccess to 0. readUntilPossible command Invoke command until it fails, then set readSuccess to 1 and returns the concatenation of the strings returned by all the successful invocation of command. readOneThenUntilPossible command It's equivalent to andRead command, readUntilPossible command andRead [commands] Invokes the commands one by one and succeeds only if every single command succeeds. When a command fails, subsequent commands are NOT invoked (short-circuit evaluation). If it succeeds it returns the concatenation of the strings returned by single commands. If it fails, it restore curExpr, blanks outData and sets readSuccess to 0. orRead [commands] Invokes the commands one by one until there are no more commands or one command succeeds. orRead succeeds only if one command succeeds (subsequent commands are ignored). If it succeeds it returns the string returned by the successful command. If it fails, it restore curExpr, blanks outData and sets readSuccess to 0. alsoRead [commands] Invokes all the commands whether readSuccess is 1 or not. It does the same even if readSuccess is initially 0. thruRead [commands] It's like andRead but, if successful, returns the string returned by the last command. optRead command If readSuccess is 1, invokes command and set readSuccess to 1. saveOutData dest, command Invokes command and, if successful, saves the string returned by it in the variable var@dest. appendOutData dest, command It's like saveOutData but the string, enclosed within < >, is appended to the content of var@dest. expandDef defName This command expands a rule so that recursion is possible during parsing. I'll talk about this later. ------------- Some examples ------------- Let's consider the languages L1, L2 and L3. 1) L1 = {a, aa, ab} re1 = a|a~a|a~b cmd1 = orRead readAtom a, readAtom aa, readAtom ab Note that cmd1 doesn't fail if curExpr isn't a word in L1. It only fails if curExpr doesn't start with a word in L1. We should also check that curExpr is been totally consumed: cmd1 = andRead , readEnd As I've already said, Fasm parses "aa" as a single symbol, then there is no explicit AND between a and a: "aa" is just an atom. But if we wanted to emulate the classic behaviour, we could separate alphabet letters with spaces. For instance, curExpr equ a b a c d e 2) L2 = {a, aa, aaa, aaaa, aaaaa, ...} re2 = a~(a*) = aa* re2b = a+ We can't do this with Fasm, so let's consider a variation of L2: L2b = {a, a a, a a a, a a a a, a a a a a, ...} cmd2 = andRead readAtom a, readUntilPossible readAtom a cmd2b = readOneThenUntilPossible readAtom a Note that in cmd2 andRead takes two arguments. 3) Let's add " " (space) to our alphabet. L3 = {push eax, push ebx, push ecx, push edx, push eax eax, push eax ebx, ... push edx edx, push eax eax eax, push eax eax ebx, ... push edx edx edx} re3 = push( e(a|b|c|d|)x)+ Here we can't parse registers character-wise but that's not much of a problem: cmd3 = andRead \ readAtom push, \ > Here's a technical note for you. The commands andRead and orRead accept a list of comma-separated commands. When a command has an argument that contains some commas, we have to enclose that argument within angle brackets (otherwise it will be interpreted as a list of arguments). If that command -- let's call it CMD -- is an argument to another command we have to enclose CMD within angle brackets as well. For instance, andRead readAtom eax, readAtom ebx is OK, but readUntilPossible andRead readAtom eax, readAtom ebx is not. Even if readUntilPossible took a list of arguments, it would be incorrect anyway, because "andRead readAtom eax, readAtom ebx" is a SINGLE command. Therefore we must write readUntilPossible If this last expression is an argument to, say, orRead we need another pair of angle brackets: andRead readAtom eax, \ > because commands which takes a list of arguments doesn't like angular brackets in the middle of single arguments either. ------------------ Algebraic commands ------------------ It's clear that things get awkward very quickly as expressions become longer and longer. The solution is to use algebraic commands. Compare: 1) L1 = {a, aa, ab} re1 = a|a~a|a~b cmd1 = orRead readAtom a, readAtom aa, readAtom ab alg1 = a|aa|ab 2) L2b = {a, a a, a a a, a a a a, a a a a a, ...} cmd2 = andRead readAtom a, readUntilPossible readAtom a alg2 = a a* cmd2b = readOneThenUntilPossible readAtom a alg2b = a+ 3) Let's add " " (space) to our alphabet. L3 = {push eax, push ebx, push ecx, push edx, push eax eax, push eax ebx, ... push edx edx, push eax eax eax, push eax eax ebx, ... push edx edx edx} re3 = push( e(a|b|c|d|)x)+ cmd3 = andRead \ readAtom push, \ > alg3 = push(eax|ebx|ecx|edx)+ Here are all the binary operators listed in increasing order of precedence together with their definition: operator = expr1 = expr2 = ... = exprn means thruRead , , ..., operator , expr1, expr2, ..., exprn means alsoRead , , ..., operator | expr1 | expr2 | ... | exprn means orRead , , ..., operator ~ expr1 ~ expr2 ~ ... ~ exprn or expr1 expr2 ... exprn mean andRead , , ..., Unary operators are applied starting from the left. For instance, a op1 op2 op3 op4 corresponds to op4(op3(op2(op1(a)))) Unary operators always take precedence over binary operators. ---------- Important! ---------- Please note that some unary operators are quite special, in fact they take two operands! I call binary only those operators which can take two arbitrary sub-expressions. Special unary operators take only one subexpression and an auxiliary parameter that cannot be a sub-expression. Here are the unary operators and their definition: operator : expr:varName means saveOutData varName, operator > expr:varName means appendOutData varName, operator _in_ expr _in_ {name1, name2, ..., namen} means assertEqual , , , ..., operator _notin_ expr _notin_ {name1, name2, ..., namen} means assertDiff , , , ..., operator * expr* means readUntilPossible operator + expr+ means readOneThenUntilPossible operator & &varName means readVar varName There are also some special unary or ternary operators (it depends on your point of view): operator ( ) (expr) means operator [ ] [expr] means optRead operator { } {nameDef} means expandDef nameDef operator < > means cmd There are some details I omitted to simplify the discussion but it's time to talk about them. ---------------------------- Special symbols and escaping ---------------------------- What happens if we want to parse the string mov [eax] ? We can't just write the algebraic parsing expression mov [eax] because [ ] is a special algebraic operator. The correct expression is mov /[eax/] where / is the escape character. To parse the symbol / itself, use // What about list {eax, ebx} ? You can't write list /{eax/,ebx/} because that closed curly brace interferes with my macros. The problem is not in the algebraic expression but in the commands which are generated: they won't work. For instance, readAtom } will never work. Do you remember the command readCCurly? You'll need to use the algebraic expression list /{eax/,ebx _CC_ where _CC_ is a special symbol for readCCurly (CC means Closed Curly). Of course, you can escape _CC_ by writing /_CC_ Here are all the available special symbols: _E_ becomes readEmpty _END_ becomes readEnd _S_ becomes readSymbol _Q_ becomes readQuoted _CC_ becomes readCCurly /< becomes readOpenAngle /> becomes readClosedAngle The last two are just escaped symbols, but they corresponds to special commands (readAtom won't work for them). Please note that all this does NOT apply to non-expression arguments. The commands assertEqual and assertDiff expect a list of names as second argument. This list cannot contain the three symbols } < > You must use, respectively, _CC_ _OA_ _CA_ Also note that you can't use complex names such as _CC_ something else Those special symbols are only recognized when used alone. In the future I might eliminate those restrictions. For now, I don't think they are much of a problem. ------------------------------------ Translation of algebraic expressions ------------------------------------ Algebraic expressions need to be translated into plain commands like the one we used before. This is done by the macro translateRules. Here is the format: translateRules ruleName1, ruleName2, ..., ruleNamen translateRules will create n variables containing the translations. Instead of giving you more details, we'll look at an example and I'll discuss it. @rule1 equ push~eax @rule2 equ push~ebx translateRules rule1, rule2 In this case translateRules does the following: cmd@rule1 equ andRead readAtom push, readAtom eax cmd@rule2 equ andRead readAtom push, readAtom ebx Now you can use the two commands as follows: readSuccess equ 1 curExpr equ push eax cmd@rule1 Note that if a rule is called , we must put its algebraic definition into the symbol @ and its translation will be put into the symbol cmd@ --------- Variables --------- The commands saveOutData and appendOutData can be used to save return values to variables. Real variable names always begin with the prefix "var@", but are referred to without it. For instance, if you call a variable "addr", its real name will be "var@addr". This is what we said before: saveOutData dest, command Invokes command and, if successful, saves the string returned by it in the variable dest. appendOutData dest, command It's like saveOutData but the string is not simply saved but appended to the content of dest. Now I'll give you the missing details. If you use the two commands above (or the two corresponding : and > algebraic operators), you need to set toSaveVarNames to the list of variables you want to use. Why is all this needed? Let's assume we want to parse a complex expression like mov eax, [eax+ebx*4 + 65h] As the expression is being parsed, many returned values are saved to our variables. What happens if the parsing fails? We have to restore the variables to their previous value! Here's an easy example: var@reg = curExpr equ mov eax, eddx @rule1 equ mov (eax|edx)>reg /, (eax|edx)>reg This is the first time we've seen > (i.e. appendOutData) in a real situation. Here's cmd@rule1 after the translation: andRead readAtom mov, >, >, > This is what happens if we invoke cmd@rule1: ; Note that var@reg = andRead { globalOutData = readAtom mov --> OK curExpr = eax, eddx outData = mov globalOutData = mov saveOutData reg, { orRead { readAtom eax --> OK curExpr = , eddx outData = eax readAtom edx --> SKIPPED } var@reg = , } globalOutData = mov eax readAtom <,> --> OK curExpr = eddx outData = , globalOutData = mov eax, saveOutData reg, { orRead { readAtom eax --> failed outData = readAtom edx --> failed outData = } readSuccess = 0 } globalOutData = curExpr = mov eax, eddx var@reg = } If andRead didn't restore var@reg to its previous value, it would now be equal to , and that's not what we want. For instance consider the following pseudo-code: orRead , , Let's suppose that: 1) andRead1 modifies var@reg and then fails. 2) andRead2 modifies var@reg and then fails. 3) andRead3 modifies var@reg and succeeds. Without proper saving and restoration, the value in var@reg would be WRONG, because it would also contain partial results written by the first two andRead. In a nutshell, you must put the name of the variables you're going to use in toSaveVarNames otherwise your variables won't be restored when needed. Luckily, matchExpr already does this for us. Its prototype is macro matchExpr mainRuleName, outVarNames where mainRuleName is the startingRule (more about this later) and outVarNames is the list of the names of the variables you want to use. Here are two examples: 1) @rule1 equ push~eax translateRules rule1 curExpr equ push eax matchExpr rule1, match =1 , readSuccess { display "match1!",13,10 } 2) @rule1 equ (add|sub):instr _S_:destOp /, _S_:srcOp translateRules rule1 curExpr equ add eax, 654h matchExpr rule1, If you want, you may call clearVars to "undefine" variables. For instance, clearVars var1, var2, var3 will undefine the variables var@var1 var@var2 var@var3 i.e. it will use "restore" until these three last names become undefined. ------------------- Rules and recursion ------------------- As we have already said, if a rule is called, say, MyRule it must be defined as follows: @MyRule equ where is an arbitrary algebraic expression. DO NOT FORGET THE CHARACTER @! IT'S REALLY IMPORTANT! You refer to that rule by its name MyRule but the algebraic expression is @MyRule and the command expression generated by translateRules is cmd@MyRule Even when you don't use translateRules or matchExpr you should stick to those directions otherwise your code might not work, especially if you use recursion. The command expandDef has the following prototype: expandDef defName where defName is the name of a definition, i.e. a rule (I should have changed some names for consistency, but I got used to them). What expandDef does is quite simple: it executes the command defName refers to. For instance, expandDef MyRule will executes the command cmd@MyRule In algebraic expressions you may use {MyRule} This is translated into expandDef MyRule expandDef is quite simple but it makes our parsing macros enormously powerful. Now we can define grammars and go well beyond what simple regular expressions can offer. ------------- User commands ------------- When you write plain commands manually it's easy to insert your own macros. For instance, you might write: macro cmd@myCommand { ... } cmd@rule1 equ andRead readAtom a, cmd@myCommand When you use algebraic expressions, you may use the operator < > as follows: macro cmd@myCommand { ... } @rule1 equ a Note that we omit cmd@ when we refer to the command. ---------------------------------- Putting it all together: examples! ---------------------------------- Instead of describing every single command in every possible detail, we'll see a long list of examples and I'll point out what I think is important. Brace yourself because the last examples are quite complex. --------- Example 1 --------- @rule1 equ push~eax translateRules rule1 curExpr equ push eax matchExpr rule1, match =1 , readSuccess { display "match1!",13,10 } This is very easy, isn't it? Note that @rule1 equ push eax would work as well. --------- Example 2 --------- @rule1 equ (mov|push)(eax|ebx) translateRules rule1 irp expr, push eax, push ebx, mov eax, mov ebx { curExpr equ expr matchExpr rule1, \match =1 , readSuccess \{ display "match2!",13,10 \} } In this example we (successfully) parse the four expressions push eax push ebx mov eax mov ebx one by one. --------- Example 3 --------- @rule1 equ (mov eax /, _S_) | (push~ebx) translateRules rule1 curExpr equ mov eax, 654h matchExpr rule1, match =1 , readSuccess { display "match3!",13,10 } Here we use _S_ to read a generic symbol and we escape the operator comma (,) with the escape operator /. The operator ~, as always, can be omitted. --------- Example 4 --------- ; Escaping. @rule1 equ mov /~ // /~ eax translateRules rule1 curExpr equ mov ~ /~eax, 654h matchExpr rule1, match =1 , readSuccess { display "match4!",13,10 } This is another example of escaping. Note that ~ is never interpreted as an operator in @rule1. --------- Example 5 --------- ; Data retrieval. @rule1 equ (add|sub):instr ~ _S_:destOp ~ /, ~ _S_:srcOp translateRules rule1 curExpr equ add eax, 654h matchExpr rule1, match =1 , readSuccess { display "match5: " \match a|b|c , var@instr|var@destOp|var@srcOp \{ display \`a," ",\`b,",",\`c,13,10 \} } clearVars instr, destOp, srcOp In this example we parse the string "add eax, 654h" like before, but this time we also save "add" in var@instr, "eax" in var@destOp and "654h" in var@srcOp. Note that matchExpr need to be passed the three variable names as a single arguments. Like we said before, clearVars clears the three variables, i.e. var@instr, var@destOp and var@srcOp are undefined symbols after that line. I've already said this briefly in the description of the commands: while saveOutData (:) overwrites variables with the exact output, appendOutData (>) append the output enclosed within < > and includes a comma if needed. For example, if var@myVar is firstValue and the next output is secondValue, saveOutData will set var@myVar to secondValue while appendOutData will set var@myVar to firstValue, If one variable is written to only by appendOutData, it will have the following form: , , , ..., --------- Example 6 --------- ; Extracts the symbols "tree", "oak", "pine", "/" and "," from the ; expression preserving the order in which they appear. @rule1 equ ((_S_ _notin_ {tree,oak,pine,/,<,>})* _S_>wordList)* translateRules rule1 curExpr equ armchair tree relax sun ,. | &/8 \ 563 sunday pine dog,oak,,,fridge... matchExpr rule1, wordList match =1 wList, readSuccess var@wordList { display "match6: " \irp w, wList \{ display \`w," " \} display 13,10 } clearVars wordList rule1 does, more or less, the following: while possible while next symbol is different from "tree", "oak", ..., or "," skip it skip the next symbol and append it (within < >) to var@wordList --------- Example 7 --------- ; Extracts the symbols "tree", "oak", "pine", "/" and "," from the ; expression preserving the order in which they appear (second version). @rule1 equ ((tree|oak|pine|/ /|/,)>wordList* _S_)* translateRules rule1 curExpr equ armchair tree relax sun ,. | &/8 \ 563 sunday pine dog,oak,,,fridge... matchExpr rule1, wordList match =1 wList, readSuccess var@wordList { display "match7: " \irp w, wList \{ display \`w," " \} display 13,10 } clearVars wordList This example is identical to previous one but for @rule1, which is equivalent but syntactically different. Note that this rule is more general because the current implementation of assertDiff (_not_in_) doesn't support general subexpressions. rule1 does, more or less, the following: while possible while the next symbol is "tree", "oak", ..., or "," skip it and append it (within < >) to var@wordList skip the next symbol ---------- Example 7b ---------- ; Extracts the symbols "tree", "pine", "/", "," and any word that begins with ; "sub-" from the expression preserving the order in which they appear ; (second version). @rule1 equ ((tree|sub/-_S_|pine|/ /|/,)>wordList* _S_)* translateRules rule1 curExpr equ armchair tree sub-aqua relax sun ,sub-zero. \ | &/8 563 sunday pine dog,oak,,,fridge... matchExpr rule1, wordList match =1 wList, readSuccess var@wordList { display "match17: " \irp w, wList \{ printStr w display " " \} display 13,10 } clearVars wordList It's clear that assertDiff (_not_in_) doesn't support the command readSymbol (_S_). --------- Example 8 --------- @someInstr equ mov|add|sub|xor|test|cmp @someRegs16 equ ax|bx|cx|dx @someRegs32 equ eax|ebx|ecx|edx @someRegs equ {someRegs16}|{someRegs32} @rule1 equ {someInstr}>__instr {someRegs}>__dest/, {someRegs}>__src translateRules someInstr, someRegs16, someRegs32, someRegs, rule1 curExpr equ sub eax, ecx matchExpr rule1, <__instr, __dest, __src> display "match8: " match =1 , readSuccess { display "OK!",13,10 stripAngBrackets var@__instr, var@__dest, var@__src var@__instr var@__dest, var@__src } match =0 , readSuccess { display "ERROR!",13,10 } clearVars __instr, __dest, __src Here we use expandDef ({ }) to refer to other rules. As you can see, the use of expandDef makes the code much more readable. Why do I use > instead of : ? That's a long story, but to make it short let's just say that there was no ':' where I wrote this example. Operator ':' makes more sense here because every variable is written to only once. Moreover, since > encloses values within angle brackets (< >), we must use something like stripAngBrackets, which does exactly what it says. The third line of code in the first match tell Fasm to assemble "sub eax, ecx". --------- Example 9 --------- ; Let's do some simple recursion. @rule1 equ NOTOK [{rule1}] ; [x] means that x is optional translateRules rule1 curExpr equ NOTOK NOTOK NOTOK NOTOK NOTOK OK! matchExpr rule1, match =1 unmExpr, readSuccess curExpr { display "match9: ", `unmExpr,13,10 } This is our first example of recursive definition or rule. We know that cmd@rule is andRead readAtom NOTOK, optRead expandDef rule1 What happens is that expandDef rule1 when executed, executes the command andRead readAtom NOTOK, optRead expandDef rule1 so @rule1 is equivalent to @rule1 equ NOTOK+ Note that optRead is needed otherwise matchExpr will always fail. Indeed there is no finite string which contains infinitely many occurrences of "NOTOK"! Now the fun begins. ---------- Example 10 ---------- ; Let's do some simple recursion! (PHASE 1) @atom equ a|b|c @binOpSum equ {atom}((/+|-){atom})* @mainRule equ {binOpSum} translateRules atom, binOpSum, mainRule curExpr equ a-b+c-a+b OK! matchExpr mainRule, match =1 unmExpr, readSuccess curExpr { display "match10: ", `unmExpr,13,10 } The rules are simple to read: an atom is "a", "b" or "c"; an "additive expression" is an atom or a sequence of more atoms separated by + or -; the main expression is an "additive expression". Here we translate more rules, but start from one rule (ovbiously). Note that when the parsing ends cursExpr is equal to "OK!". That is exactly what that code will display if everything's OK. Now note that something like @algExpr equ /({algExpr}/) | {algExpr}/+{algExpr} | x|y|z won't work because if /({algExpr}/) fails, {algExpr}/+{algExpr} is processed and therefore its first part ({algExpr}) is expanded. The result is @algExpr equ /({algExpr}/) | {algExpr}/+{algExpr} | x|y|z Now /({algExpr}/) fails again and here we are again... infinite loop! ---------- Example 11 ---------- ; Let's do some simple recursion! (PHASE 2) @atom equ a|b|c|d|e|f @binOpSum equ {binOpProd}((/+|-){binOpProd})* @binOpProd equ {atom}((/*|//){atom})* ; */ (C++ editor workaround!) @mainRule equ {binOpSum} translateRules atom, binOpSum, binOpProd, mainRule curExpr equ a*b*c + c*d + e*f OK! matchExpr mainRule, match =1 unmExpr, readSuccess curExpr { display "match11: ", `unmExpr,13,10 } These rules tell us that: an atom is "a", "b", "c", "d", "e" or "f"; an "additive expression" is a "multiplicative expression" or a sequence of more "multiplicative expressions" separated by + or -; a "multiplicative expression" is an atom or a sequence of more atoms separated by * or /; the main expression is an "additive expression". Why are "additive expressions" defined as above? Because that way multiplication (and division) has precedence over addition (and subtraction). If you're not sure, spend some time convincing yourself of this! ---------- Example 12 ---------- ; Let's do some simple recursion! (PHASE 3) @atom equ a|b|c|d|e|f @operand equ /({binOpSum}/) | {atom} ; handles sub-expressions! @binOpSum equ {binOpProd}((/+|-){binOpProd})* @binOpProd equ {operand}((/*|//){operand})* @mainRule equ {binOpSum} translateRules atom, operand, binOpSum, binOpProd, mainRule ;curExpr equ (a*b*c + d +) c*(d+e)*f OK! ; <-- wrong expression! curExpr equ (a*b*c + d) + c*(d+e)*f OK! ; <-- correct: this works! matchExpr mainRule, match =1 unmExpr, readSuccess curExpr { display "match12: ", `unmExpr,13,10 } Here we simply added the concept of sub-expression by inserting another "indirection": now * and / operators work on operands rather than directly on atoms. Operands are expressions enclosed within round brackets or atoms. I admit that I might have chosen better names for my rules... Note that the commented curExpr is wrong; indeed the second closed bracket is misplaced. As expected, if you use that string, the parsing will fail. ---------- Example 13 ---------- ; Let's do some simple recursion! (PHASE 4) @atom equ a|b|c|d|e|f @operand equ /({binOpSum}/) | {atom} ; handles sub-expressions! @unaryOps equ {operand} !* @binOpSum equ {binOpProd}((/+|-){binOpProd})* @binOpProd equ {unaryOps}((/*|//){unaryOps})* @mainRule equ {binOpSum} ;translateRules atom, operand, unaryOps, binOpSum, binOpProd, mainRule ;curExpr equ (a*b*c + d +) c*(d+e)*f OK! ; <-- wrong expression! curExpr equ (a*b*c + d ! ! !) + c*(d+e)*f OK! ; <-- correct: this works! ; IMPORTANT: ; ! is not a special symbol to FASM, so a! will be recognized as a single ; symbol. ; However, (a !) will work just fine because ) is a special symbol and ; thus !) are always read as separate symbols. matchExpr mainRule, match =1 unmExpr, readSuccess curExpr { display "match13: ", `unmExpr,13,10 } We added another level of "indirection": * and / operators work not on operands but on "unary operator expressions". A "unary operator expression" is an operand followed by zero, one or more unary operators. In this case, we just added the unary operator factorial (!). Make sure to read the important note in the code. ---------- Example 14 ---------- ; Let's do some simple recursion! (PHASE 5) @atom equ a|b|c|d|e|f @operand equ /({binOpSum}/) | {atom} ; handles sub-expressions! @unaryOps equ {operand} !* @binOpSum equ {binOpProd}((/+|-){binOpProd})* @binOpProd equ {defaultOp}((/*|//){defaultOp})* @defaultOp equ {unaryOps}({unaryOps})* @mainRule equ {binOpSum} translateRules atom, operand, unaryOps, binOpSum, binOpProd, \ defaultOp, mainRule ;curExpr equ (a*b*c + d +) c*(d+e)*f OK! ; <-- wrong expression! curExpr equ (a*b*c + d e + d ! ! !) + c(d+e)f OK! ; <-- correct! ; IMPORTANT: ; ! is not a special symbol to FASM, so a! will be recognized as a single ; symbol. ; However, (a !) will work just fine because ) is a special symbol and ; thus !) are always read separate symbols. matchExpr mainRule, match =1 unmExpr, readSuccess curExpr { display "match14: ", `unmExpr,13,10 } What we did is quite simple: we made * the default operator, that is, * can be omitted (like ~ in our algebraic expressions). In example 13 binOpProd is defined as an expression composed of "unary operator expressions" connected by * or /. Now binOpProd is defined as an expression of "default operator expressions" connected by * or /. A "default operator expression" (@defaultOp) is an expression composed of "unary operator expressions" juxtaposed i.e. connected by an "invisible" or, better, understood operator, i.e. our default operator *. ---------- Example 15 ---------- macro cmd@initLocals { var@curData equ var@curOp equ var@opd2 equ } macro cmd@saveLocals { savedCurData equ var@curData savedCurOp equ var@curOp savedOpd2 equ var@opd2 } macro cmd@updCurData { var@curData equ var@curOp var@curData var@opd2 } macro cmd@restoreLocals { var@curData equ savedCurData var@curOp equ savedCurOp var@opd2 equ savedOpd2 restore savedCurData restore savedCurOp restore savedOpd2 } ; Let's do some simple recursion! (PHASE 6 (PHASE 1 REVISITED)) @atom equ a|b|c @binOpSum equ , \ ({atom}:curData \ ( (/+|-):curOp {atom}:opd2 )* \ = &curData), \ @mainRule equ , {binOpSum} translateRules atom, binOpSum, mainRule curExpr equ a+c-b-a OK! matchExpr mainRule, match =1 unmExpr, readSuccess curExpr { display "match15: outData = " printStrN outData } We want to take an algebraic expression and rewrite it in Polish notation. Polish notation is very simple: a becomes a a bin_op b becomes bin_op a b a unary_op becomes unary_op a So, for instance, 5+6 becomes +5 6 while 5+6*7 becomes +5*6 7 Think of it as a tree: + 5 * 6 7 and now walk the tree in pre-order, i.e. root-left-right: +5*6 7 as expected. The expression (1+2)3! + 4 can be represented by the following tree: + * 4 + ! 1 2 3 Therefore it can be written as +*+1 2!3 4 Now remember that @binOpSum was @binOpSum equ {atom}((/+|-){atom})* in phase 1. Here we want to save what we read, so we decide to use three variables: curData it contains the partial output curOp it contains the last seen operator opd2 it contains the last seen right operand We end up with @binOpSum equ {atom}:curData((/+|-):curOp {atom}:opd2)* There's a problem: we know that binOpSum is to be called recursively in the complete solution. What happens to our three variables? In the final solution, that two {atom} will be substituted for two {binOpProd} which will probably invoke binOpSum at some level of the recursion. When this happen, our three variables are overwritten. What would happen if a recursive function used only global variables instead of local ones? Here the same thing would happen. One solution is to save the values of the "local variables" and then restore them. We decide that whoever modifies the values must save and restore the old values. The operator alsoRead (,) executes a sequence of commands one after another. It doesn't do anything special. We use < > to call our own commands: @binOpSum equ , \ {atom}:curData((/+|-):curOp {atom}:opd2)*, \ What we do is this: 1) we save the "local variables" 2) we use the "local variables" 3) we restore the "local variables" This way we don't interfere with the "caller" which is probably just using the very same "local variables". Note that cmd@restoreLocals uses the directive "restore" on the global variables we use to save our fake local variables. Why don't we use "restore" on the local variables itself? The problem is that we don't know how many times our local variables are written to. If the expression is operand1 * operand2 we modify curData twice, but if the expression is operand1 * operand2 * ... * operandn we modify it n times! But we know we modify the global variables savedCurData savedCurOp savedOpd2 only once, i.e. when we call saveLocals. Now there's another problem. Our @binOpSum returns what its second line returns because our macros saveLocals and restoreLocals don't modify outData. But aren't we supposed to read operand1 op operand2 and return op operand1 operand2 ? That is how we (recursively) defined the Polish notation, after all. We could proceed in many ways but what I chose is this: curData = left_operand while we come across an operator and a right operand (coupled) curData = operator curData right_operand This means that a+b+c becomes ++a b c and, in general, a+b+c+...+z becomes ++...+a b c ... z This is perfectly in accordance with our definition of Polish notation. How do we do that? That {atom}:curData writes the left operand to curData. That's exactly what we wanted. Now we modify ((/+|-):curOp {atom}:opd2)* in this way: ((/+|-):curOp {atom}:opd2 )* What we did was just add a call to updCurData, an our own macro. Here's the macro: macro cmd@updCurData { var@curData equ var@curOp var@curData var@opd2 display "updCurData",13,10 } What it does should be perfectly obvious. It modifies curData exactly like we said in the pseudo code above: "curData = operator curData right_operand". Like we said, our @binOpSum returns what its second line returns: it returns what it read! But we want to return curData! One solution is to use thruRead. Its brief description says: thruRead [commands] It's like andRead but, if successful, returns the string returned by the last command. This means that if we have an expression MAGIC_EXPR that returns the value of curData and we write ((/+|-):curOp {atom}:opd2 )* = MAGIC_EXPR we are done. Lucky for us, readVariable curData is that expression. In algebraic form it becomes &curData This is our final rule: @binOpSum equ , \ ({atom}:curData \ ( (/+|-):curOp {atom}:opd2 )* \ = &curData), \ Note that we add two round brackets because ',' (i.e. alsoRead) has precedence over '=' (i.e. thruRead). ------------------------------------ Example 16 (putting it all together) ------------------------------------ macro cmd@initLocals { var@curData equ var@curOp equ var@opd2 equ } macro cmd@saveLocals { savedCurData equ var@curData savedCurOp equ var@curOp savedOpd2 equ var@opd2 } macro cmd@updCurData { var@curData equ var@curOp var@curData var@opd2 } macro cmd@updCurDataProd { var@curData equ * var@curData var@opd2 } macro cmd@updCurDataFact { var@curData equ ! var@curData } macro cmd@restoreLocals { var@curData equ savedCurData var@curOp equ savedCurOp var@opd2 equ savedOpd2 restore savedCurData restore savedCurOp restore savedOpd2 } ; Let's do some simple recursion! (PHASE 7) @atom equ a|b|c|d|e|f @subExpr equ , \ (/({binOpSum}:curData/) = &curData), \ @operand equ {subExpr} | {atom} @unaryOps equ , \ ({operand}:curData (! )* = &curData), \ @binOpSum equ , \ ({binOpProd}:curData \ ( (/+|-):curOp {binOpProd}:opd2 )* \ = &curData), \ @binOpProd equ , \ ({defaultOp}:curData \ ( (/*|//):curOp {defaultOp}:opd2 )* \ = &curData), \ @defaultOp equ , \ ({unaryOps}:curData \ ( {unaryOps}:opd2 )* \ = &curData), \ @mainRule equ {binOpSum} translateRules atom, subExpr, operand, unaryOps, binOpSum, \ binOpProd, defaultOp, mainRule ;printStrN cmd@defaultOp ;curExpr equ (a*b*c + d +) c*(d+e)*f OK! ; <-- wrong expression! curExpr equ (a*b*c + d e + d ! ! !) + c(d+e)f OK! ; <-- correct! ; IMPORTANT: ; ! is not a special symbol to FASM, so a! will be recognized as a single ; symbol. ; However, (a !) will work just fine because ) is a special symbol and ; thus !) are always read separate symbols. matchExpr mainRule, match =1 unmExpr, readSuccess curExpr { display "match16:",13,10 display " outData = " printStrN outData display " curExpr = " printStrN curExpr } If you *did* understand the last example, this one should be straightforward. The only thing to note is that we don't need to save all the three variables every time. For instance, we only need to save var@curData in @unaryOps. I use saveLocals and restoreLocals for simplicity. That's all for now!