Parser reference manual

1. Introduction

The parse_string() utility parses strings as follows: first, the string is
broken up into two types of smaller strings, called "tokens" and "whitespace".
Next, the resulting array of tokens (whitespace is ignored) is parsed as a
sentence in the language defined by a formal grammar.  For any valid parsings,
LPC functions are called as appropriate, which may transform the array of
tokens, or indicate that a particular parsing of the string is yet to be
considered invalid.  Finally, the resultant array of tokens, if any, is
returned by the function.

Regular expressions are used to define tokens and whitespace, and the language
of those tokens is defined by a context-free grammar, which can be enhanced
with LPC function calls.  Token/whitespace definitions and language definition
together are called "the grammar".

Internally, two partial automatons are generated for each grammar, one to
match tokens and one to parse a sentence of tokens.  Only those parts of the
automatons are generated that are required to parse the sentences seen so far;
this effect is cumulative for successive calls in each object, ensuring that
the automatons are not needlessly generated more than once.  Each object can
handle only one grammar at a time.  If a grammar differs from the previous
one used by the same object, any existing generated automatons are wiped out
and constructed anew.


2. Grammar

A grammar consists of token rules and production rules, which may be given in
any order.  Each grammar must contain at least one token rule and one
production rule.


2.1 Token rules

A normal token rule has the following format:

    token = /regexp/

where "token" is an identifier name as in LPC, and "regexp" a regular
expression, with the following syntax:

    c	    a single character except `.', `[', `\', `*', `+', `(', `)', `/'
    .	    any single character
    \c	    a single character c, including `.', `[', `\', `*', `+', `(', `)',
	    `/'
    [set]   a single character in the given set, which is constructed of
	    single characters such as `a' and/or character ranges such as
	    `a-z'.  `\' may be used to escape the characters `]', `^', `-', `\'
    [^set]  a single character not in the given set

and, with regular expressions "a" and "b":

    a*      zero or more occurrences of a	(highest precedence)
    a?      zero or one occurrences of a
    a+      one or more occurrences of a
    ab      the concatenation of a and b
    a|b     a or b
    (a)     a					(lowest precedence)

For any regular expression, the longest possible token will be matched.  The
name "whitespace" is reserved for defining a special token, which is simply
skipped.  More than one rule may be specified for each token, including
whitespace.

If a string matches more than token, the token for which the rule appears first
in the grammar is selected.  If a string does not match any token, it is
rejected and parsing fails.

A special token rule may be defined which matches anything not matched by
other rules:

    token = nomatch

where "nomatch" is literal text.  Nomatch rules can be very inefficient, and
their use should be avoided if possible.


2.2 Production rules

A production rule has the following format:

    symbol : rule

where "symbol" is an identifier name as in LPC for the production rule, and
"rule" consists of zero or more token symbols, production rule symbols, or
string constants delimited by single quotes.

A string constant in a production rule matches the given sequence of characters
directly, without reference to a token rule.  Since a token rule is implicitly
generated for such string constants, the string constant is also said to be a
token.  String constants take precedence over regular expressions, and any
token that matches both a string constant and a regular expression will always
match the string constant, only.  `\' may be used in a string constant to
escape the single quote.

More than one production rule may be given for the same symbol.  All
production rules taken together form a context-free grammar, for which the
start symbol is defined in the first rule.  Production rules that are
unreachable from the start symbol are ignored.  Any symbol used in a
production rule reachable from the start symbol must itself define a token
rule or a production rule.

Optionally, a rule can be followed by an LPC function name:

    symbol : rule ? function

where "function" is the name of an LPC function to be called in the current
object if the specified rule has been matched.


3. Parsing

Normally, parse_string() returns a flat array of token strings (if the input
string could be parsed as a sentence in the language defined by the grammar)
or nil (if the input string could not be parsed).  However, the return value
can be modified to an array containing arbitrary other values, including
sub-arrays, by specifying LPC functions in the grammar that will be called
while parsing.  Moreover, if the grammar is ambiguous, sub-arrays for the
alternative may be automatically included in the result.


3.1 LPC functions

An LPC function called for a production rule will have a single argument,
an array with the parse tree that matches the production rule.  LPC functions
are called in bottom-up order, so any function calls for sub-parse trees in
the current parse tree will already have been made.

The LPC function can return its argument unchanged, or modify its argument
before returning it, or return a completely different value.  Whatever the
return value is, it is taken to be an array matching the current rule; if
it is not an array at all, the current parse tree (and all parse trees that
depend on it) is discarded altogether.

LPC functions are typically used to replace tokens in the parse tree by
their semantic value, or to add structure to a parse tree, which otherwise
would be represented as a flat array.


3.2 Ambiguous grammars

A grammar that allows more than one parse tree for the same input string is
said to be "ambiguous".  The string parser will sort alternative parse trees
by the order of the topmost different rule in the grammar, and by default
ignore all of them but the first one.

An optional third argument to parse_string() can be used to retain an
additional number of alternative parse trees, which will be merged with the
first one; the argument specifies the number of extra alternatives to keep at
each point where sub-parse trees are handled by different rules in the grammar.
Instead of incorporating the first alternative immediately in the array
returned, a sub-array containing all of the alternatives as separate elements
is included, sorted by the order of their topmost rule in the grammar.

LPC functions called for rules can be used to invalidate alternative parse
trees.  Invalidated alternatives will be removed from the overall result.


3.3 Writing grammars

Grammars can be parsed most efficiently if they are not ambiguous.  Generally,
ambiguous grammars can be rewritten to be non-ambiguous.  For example,
consider the following (partial) grammar, which is ambiguous:

    Expr: number
    Expr: Expr '+' Expr

It can be rewritten to be either left-recursive,

    Expr: number
    Expr: Expr '+' number

or right-recursive:

    Expr: number
    Expr: number '+' Expr

Neither rewritten grammar is ambiguous, and both explicitly specify the order
in which expressions are to be evaluated: from left to right, or from right to
left, respectively.  If the order doesn't matter, left recursion should be
preferred, as this is parsed more efficiently.