The SID users' guide
- i. Introduction
- 1. Grammars
- 2. Overview
- 3. The sid grammar file
- 4. The C information file
- 4.1. Lexical conventions
- 4.2. The prefixes section
- 4.3. The persistent section
- 4.4. The maps section
- 4.5. The header section
- 4.6. The assignments section
- 4.7. The parameter assignments section
- 4.8. The result assignments section
- 4.9. The terminal result extraction section
- 4.10. The action definition section
- 4.11. The trailer section
- 5. Features
- A. Understanding error messages
- B. Advice on writing parsers with sid
© , , , The TenDRA Project.
© , DERA.
© , DRA.
First published .
Revision History
kate | Include libexds as a partial archive, for convenience of building. My rationalle here is that SID and TLD are likely to be the only consumers for libexds, and so requiring its installation as a library seems excessive. However, it may still be installed separately, should anybody want that. | |
kate | Removed error file support, and hence the build-time dependency on Perl. | |
kate | Terminals (and only terminals) may now be optionally quoted. This acts as a convenience for terminals with mundane names, such as punctation symbols. For example, one might write | |
kate | Types may now be ignored in a similar way to actions and terminals, by prefixing a Ignored types may be present in ignored terminals and ignored actions, but are dissallowed elsewhere. | |
kate | Added BNF output as The output simply dumps the grammar contents, disregarding all the entries which do not pertain to grammatical constructs. The empty set is represented as This is our first non-C language. | |
kate | Introduced a new expansion within C actions, Since it is output using the same mechanism as references to terminals elsewhere in the generated parser, it is subject to the same expansion rules. More notably, this includes | |
kate | Introduced a new language-specific option, This is unrelated to This defaults to | |
kate | Action declarations may now be optionally ignored in much the same way as terminals may be, by prefixing a %productions% !<an-ignored-action>; This is particularly useful for several grammars which share the same actions, where not all of the grammars make use of the actions provided. | |
kate | Added a new language-specific option: Omitting these is desirable if the user wishes to define her own values for terminals, rather than using the SID-supplied macros. This is especially useful for sharing several grammars with the same actions, where the terminal values are to be centralised between the various parsers. | |
kévin | Added persistant variable support for C actions. | |
kate | Tagged version 1.10. The first stand-alone release of sid seperately to the TenDRA compiler. It is a code clean-up release; features are unchanged from 1.9. | |
kate | Restructured for initial stand-alone release. The only user-visible change here is that -? is replaced with -h. An example of using sid with lex has been added. This is the first release of sid stand-alone under tendra.org. | |
asmodai | Moved out SID to a standalone tool. | |
asmodai | Converted to a new build system. | |
DERA | SID 1.9; TenDRA 4.1.2 release. | |
smf | Added support for an unreachable code macro (instead of a comment) in the C output languages. | |
smf | Froze version 1.8. | |
smf | Added assignment as an alternative to identity binding. | |
smf | Added initialisers for non local variables. | |
smf | Added support for non local variables within rules. Froze version 1.7. | |
smf | Added support for language specific options. Added options to control prototyping, use of numeric identifiers and casting of substituted parameters for C output language. | |
smf | Added support for explicit call by reference and action parameter mutation information (rather than the inconsistent function call semantics of earlier versions). | |
smf | Changed the format of the C definition file to provide a parameter assignment operation (this should have been in from the start for consistent semantics). Also, the prefix section, maps section and all three assignment operation sections are now optional. | |
smf | Improved error messages. Made dump file be updated with information from the phase that failed in the case of an error (this is not always useful though - the grammar may be in a useless state after a failed transform; left recursion elimination is particularly bad for this). | |
smf | Added a stricter ANSI C language that uses numeric identifiers to keep the identifier length below 32 characters. Froze version 1.6. | |
smf | Changed the syntax of the input files so that the grammar looks more like a conventional programming language. Removed basics (replacing them with terminal symbols). Added a header file to the C language output. | |
smf | Added "--show-errors" option to show the error table status, and a "--help" option to show the command line syntax. Froze version 1.5. | |
smf | Changed command line syntax to be compatible with the TDF linker. | |
smf | Froze version 1.4. | |
smf | Added | |
smf | Added support for anonymous rules and in alternative exception handlers. Fixed cycle detection routine used in needed function computation. Froze version 1.3. | |
smf | Changed syntax of code blocks in action file. SID now inlines all actions, basics, assignments etc. and performs the substitutions. Substitutions allow for identifier and label substitution, as well as exception raising. | |
smf | Put name identifiers in a separate namespace. Stopped it being an error if the "-version" switch is used with no files specified. | |
smf | Added a test mode language. Froze version 1.2. | |
smf | Added exception handling instead of old error reporting mechanism. | |
smf | Added option of inlining rules that only contain basics. Changed output routines, so that parameters and results of inlined functions are substituted, rather than created and assigned to. | |
smf | Changed identifier prefixes. Made grammar printing mark inlinable rules with a '+'. | |
smf | Added predicate support. Improved factoring to allow alternatives with the same dataflow, but different names to be factored. Allowed types to be defined that were only the result types of terminals. Stopped rules being declared as entry points to the grammar multiple times. | |
smf | Fixed the cycle detection routines used in tail recursion detection. Froze version 1.1. | |
smf | Fixed the cycle detection routines used in left recursion detection. Fixed bug in outputting code for see through alternatives. Sorted error lists to make comparisons easier. Modified grammar printing to mark tail calls. | |
smf | SID 1.0; Initial version of the new sid. Adds dataflow and marginally improved error handling to the original version of sid. Also fixes some bugs in the algorithms used in the earlier version of sid. Froze version 1.0. This was the first version of sid to support attribute grammars. The output language was C. |
i. Introduction
sid turns specifications of languages into programs that recognise those languages. One of the aims of sid was to separate the specification of the language to be recognised from the language that the recogniser program is written in. For this reason, input to sid is split into two components: output language independent information, and output language dependent information.
At present, sid will only output programs in C (either ISO or pre-ISO), but it is designed so that adding new output languages should be fairly simple. Output of the grammar to BNF is also provided. There is one other pseudo-language: the test language. This is used for testing grammars and the transforms, but will not output a parser.
This document describes how to use the sid parser generator. A little background on grammars is given, and how to write grammars suitable for use with SID. The interfaces for the generated code which SID produces are specified. A tour of SID's more interesting features is given, and some guiding advice for general use.
1. Grammars
1.1. Parsing
There are two phases in traditional language recognition that are relevant to sid: the first is lexical analysis (breaking the input up into terminal symbols); the second is syntax analysis or parsing (ensuring that the terminal symbols in the input are in the correct order).
sid currently does very little to help with lexical analysis. It is possible to use sid to produce the lexical analyser, but sid provides no real support for it at present. For now, the programmer is expected to write the lexical analyser, or use another tool to produce it.
The lexical analyser should break the input up into a series of terminals. Each of these terminals is allocated a number. In sid, these numbers range from zero to the maximum terminal number (specified in the grammar description). The terminals may also have data associated with them (e.g. the value of a number), known as the attributes of the terminal, or the results of the terminal.
sid generates the parser. The parser is a program that reads the sequence of terminals from the lexical analyser, and ensures that they form a valid input in the language being recognised. If the input is not valid, then the parser will fail (sid provides mechanisms to allow the parser to recover from errors).
1.2. Context free grammars
This section provides a brief introduction to grammars. It is not intended to be an in-depth guide to grammars, more an introduction which defines some terminology.
sid works with a subset of what are known as context free grammars. A context free grammar consists of a set of input symbols (known as terminals), a set of rules (descriptions of what are legal forms in the language, also known as non-terminals), and an entry point (the rule from which the grammar starts).
A rule is defined as a series of alternatives (throughout this document the definition of a rule is called a production - this may conflict with some other uses of the term). Each alternative consists of a sequence of items. An item can be a terminal or a rule. As an example (using the sid notation, which now looks unlike the conventional syntax for grammars), a comma separated list of integers could be specified as:
list-of-integers = { integer ; || integer ; comma ; list-of-integers ; } ;
This production contains two alternatives: the first matches the terminal integer
; the second matches the sequence of terminals integer
followed by comma
, followed by another list of integers.
There is much documentation available on context free grammars (and other types of formal grammar). The reader is advised to find an alternative source for more information.
1.3. sid grammars
sid grammars are based upon a subset of context free grammars, known as LL(1) grammars. The main property of such grammars is that the parser can always tell which alternative it is going to parse next by looking at the current terminal symbol. sid does a number of transforms to turn grammars that are not in this form into it (although it cannot turn all possible grammars into this form). It also provides facilities that allow the user to alter the control flow of the parser.
sid makes the following changes to the context free grammars described above:
There may be more than one entry point to the grammar.
As well as being a rule or a terminal, an item may be an action, a predicate or an identity. An action is just a user supplied function. A predicate is a cross between a basic and an action (it is a user defined function but it may alter the control flow). An identity is like an assignment in a conventional programming language.
Rules may take parameters and return results (as may actions; terminals may only return results). These may be passed between items using names.
Each rule can have an exception handler associated with it. Exception handlers are used for error recovery when the input being parsed does not match the grammar.
Rules may be anonymous.
Rules may have non-local names associated with them. These names are in scope for that rule and any rules that are defined inside it. The value of each non-local name is saved on entry to the rule, and restored when the rule is exited.
2. Overview
sid takes the input grammar and applies several transformations to it, before it produces the parser. These transformations allow the language descriptions to be written in a slightly more natural form than would otherwise be necessary.
2.1. Left recursion elimination
The first transformation is to eliminate any left recursive productions, replacing them with right recursive ones. This transforms a set of rules of the form:
Ai = Aj Bji, Ci
into:
Ai = Cj Xji Xji = Bjk Xki, Yji
where Yji
is the identity function if i
equals j
, and an error otherwise. In order for this transformation to work, the following conditions must hold:
-
The parameter type for all
Ai
must be the same. -
The result type for all
Ai
must be the same. -
The exception handlers for all
Ai
must be the same. -
The parameters to each left recursive call to some
Aj
must be exactly the formal parameters to the calling ruleAi
. -
Any non-local name referenced by any rule must be in scope for all rules.
-
A rule may not define non-local variables unless it is the only entry point into the cycle (i.e. there is only one named rule in the cycle).
sid will report an error if these conditions are not met.
2.2. Factoring
The second major transformation is to factor the grammar. It is possible that this phase could go on for ever, so there is a limit to the number of rules that can be generated. This limit may be changed (see the invocation section). In practice it is rare for this to happen.
The factoring phase tries to increase the number of language specifications that sid can produce a parser for, by factoring out common initial items in alternatives, e.g.:
X = { a ; b ; || a ; c ; } ;
would be transformed into something like:
X = { a ; X1 ; } ; X1 = { b ; || c ; } ;
It will also expand calls to rules at the beginning of alternatives if this is necessary to allow the parser to be produced (this is the phase that the limit is needed for). The expansions are done in the following cases:
-
If the rule is see-through (i.e. there is an expansion of the rule that does not contain any terminals or predicates) and its first set contains terminals in the first set of the rest of the alternative.
-
If the rule has a predicate in its first set.
-
If the rule has terminals in its first set that are also in the first set of another alternative that does not begin with the same rule.
If the rule being expanded contains an exception handler, and it is not identical to the exception handler in the enclosing rule, then an error will occur. Similarly if the rule being expanded defines any non-local variables then an error will occur.
2.3. Optimisations
After these two transformations, sid performs some optimisations on the grammar, such as reducing multiple copies of identical rules, removing tail recursion, inlining all basic rules, inlining single alternative rules, inlining rules used only once, and inlining everything that can be inlined. All of the inlining is optional.
After these optimisations, sid checks the language description to ensure that it is possible to produce a parser for it, and if so it outputs the parser.
3. The sid grammar file
- 3.1. Lexical conventions
- 3.2. The type declaration section
- 3.3. The terminal declaration section
- 3.4. The rule definition section
- 3.5. The grammar entry points section
The sid grammar file should always be the first input file specified on the command line. It provides an output language independent description of the language to be recognised. The file is split up into a number of different sections, each of which begins with a section header. All sections must be included, although it is possible to leave most of them empty. The following sections exist at present: type declaration, terminal declaration, rule definition, and grammar entry points. They must appear in that order. The sections are detailed below, after the lexical conventions.
3.1. Lexical conventions
Almost all non-printable characters (but definitely space, tab, newline and form-feed) are considered to be white space, and are ignored other than to separate other tokens. In addition, two comment forms are recognised: the C comment syntax, where comments begin with /*
, and end with */
, although unlike C these comments nest properly; and the C++ line comment syntax, where comments begin with //
, and are terminated at the end of the line. Comments are treated in the same way as white space characters.
Section headers are enclosed in percent characters, and are case insensitive. The following section headers are currently recognised:
%types% %terminals% %productions% %entry%
Identifiers must begin with a letter, an underscore or a hyphen. This may be followed by zero or more letters, digits, underscores and hyphens. Identifiers are case sensitive. The following are all legal identifiers:
expression mult-expr plus_expr expr-1
Identifiers are split into two namespaces: local names have one space; types, actions, rules, non-local names and terminals share the other namespace, so it is not possible to have an identifier that is a type as well as being a rule for example.
The following symbols are also used:
& ; = [ : ] ! , || $ ? { } ( ) < > -> :: ##
All other characters are illegal.
3.2. The type declaration section
The first section is the type declaration section. This begins with the types section header, followed by the names of all of the types used by the grammar. Each type name should be terminated by a semicolon. An example of the type declaration section follows:
%types% NumberT ; StringT ;
This declares two types: NumberT
and StringT
. There is no requirement for the type names to resemble names in the target language (in fact this should be avoided, as it is possible to use many different target languages). All types used in the grammar must be declared here. Similarly, all types declared here must be used in the grammar.
Types may be prefixed with !
if they are not to be used in the grammar. Such ignored types may however be present in ignored terminals and actions. This is intended to facilitate the use of types which are specific to actions only used by one grammar, where the actions are shared between several grammars.
3.3. The terminal declaration section
After the type declaration section comes the terminal declaration section. This section declares the terminals that will be used by the grammar. A terminal is a recogniser for a symbol in the input alphabet of the language to be recognised. It is possible to declare terminals that are not used by the grammar.
The section begins with its section header, followed by the declarations of the terminals. Each terminal declaration begins with the name of the terminal being defined, followed by its type, and terminated by a semicolon. If the terminal is not used in the grammar, the declaration should be preceded by a !
symbol.
A type (for terminals, rules and actions) is written as a colon, followed by the parameter tuple, followed by the ->
symbol, followed by the result tuple. If the type is zero-tuple to zero-tuple, then the type may be omitted. A tuple consists of a comma separated sequence of name-type pairs (with the name and type being separated by a colon), surrounded by parentheses. For parameter tuples, the type may be suffixed by a &
symbol, indicating that call by reference should be used (the default is call by copy). For declarations, the names should be omitted. For terminals, the parameter type must be the zero-tuple.
The simplest type of terminal declaration is as follows:
terminal1 ;
This means the same as:
terminal1 : () -> () ;
An example of a more complex terminal declaration is:
terminal2 : () -> ( :StringT ) ;
If these terminals were not to be used in the grammar, they should be declared as:
!terminal1 ; !terminal2 : () -> ( :StringT ) ;
Terminals may optionally be quoted; this permits the use of non-alphanumeric characters, which is helpful if they don't have convenient names. It can also help make the grammar a little clearer to read, if the associated lexer is not to hand. For example:
!"->" ; "*" ; "{" : () -> ( :BraceT ) ;
As with other symbols, these are transformed in the generated parser according to the respective language-specific rules for producing legal identifiers in that language.
3.4. The rule definition section
The rule definition section follows the terminal declaration section. It begins with the section header, followed by the definitions and declarations of all of the rules used in the grammar, and the declarations of all of the actions used in the grammar.
Rule declarations look identical to terminal declarations, e.g.:
rule1 : ( :NumberT ) -> ( :NumberListT ) ; rule2 ;
Action declarations are similar, although the names are surrounded by angle brackets, e.g.:
<action1> : ( :StringT & ) -> () ; <action2> ;
Action declarations may be prefixed with !
if they are not to be used in the grammar. This is particularly useful for several grammars which share the same actions, where not all of the grammars make use of the actions provided.
A declaration (or a definition) may be prefixed with the ::
symbol. This symbol forces the definition into the outermost scope. Scopes are described later on.
A rule definition (called a production) looks something like the following:
add-expr : () -> ( v : NumberT ) = { v1 = mul-expr ; plus ; v2 = expr ; v = <add> ( v1, v2 ) ; || v1 = mul-expr ; minus ; v2 = expr ; v = <subtract> ( v1, v2 ) ; ## v = <ex> ; } ;
The production begins with the rule name, followed by the parameter and result names and types (in this case, the rule name is add-expr
, there are no parameters, and there is one result name v
of type NumberT
). This may optionally be followed by local declarations (there are none here - they are described later).
The left hand side of the rule is followed by the =
symbol, a list of alternatives surrounded by curly braces, and is terminated by a semicolon. The alternatives are separated by the ||
symbol, and the last alternative may be separated from its predecessor (there must be one) using the ##
symbol; if this is the case, then this alternative is the exception handler for the production (otherwise it has no exception handler).
An alternative may match the empty string, using the symbol $
and the terminator symbol ;
, i.e.:
$ ;
unless it is an exception handler alternative (in which case it must do something), or a sequence of items. The empty string is only valid if the production has no results. If you want to match an empty string in a production that has a result, it is necessary to use an action (or identity) to provide a result.
An item is an identity, or a call to a (possibly anonymous) rule, a terminal, an action or a predicate. An identity looks like an assignment in conventional programming languages:
( a, b, c ) = ( d, e, f ) ;
Each tuple must contain the same number of names. In the case of a one-tuple, the parentheses may be dropped, e.g.:
a = d ;
Note that this is a binding operation, not an assignment. Each name on the left hand side must be a new name. It is illegal to redeclare a name that is already in scope. It is possible to assign to a name which is already in scope by prefixing that name with the &
symbol, e.g.:
( a, &b, c ) = ( d, e, f ) ;
would assign to the name b
, which must have been previously defined (it may be a parameter; if it is a call by reference parameter, then the change will propagate outside to the calling rule).
It is also possible to use the !
symbol in the result tuple to ignore results, e.g.:
( a, !, b, ! ) = ( c, d, e, f ) ;
This is not particularly useful in an identity, but may be more useful in a call to a rule, terminal or action. A call to a terminal or rule looks like a call to a function in a conventional programming language, e.g.:
( a, b ) = rule1 ( c, d ) ; ( e, f ) = terminal1 () ;
Calls to actions have the same form, except that action names are surrounded by angle brackets, e.g.:
( g, h, i ) = <action1> ( a, e ) ;
In addition, one of the names in the result tuple of the call to the action may be the predicate result symbol ?
, in which case the action is used as a predicate (more details on predicates are given later).
When calling a rule, terminal or action, it is necessary to have declared it (or in the case of a rule declared or defined it) before the call.
If a rule or action is being invoked, and it takes one or more call by reference parameters, then the corresponding arguments should be prefixed by the &
symbol, e.g.:
length = <string-length> ( &string ) ;
If the rule, terminal or action has the zero-tuple as a result, then only the right hand side of the definition is required, e.g.:
rule2 ( a, b ) ; terminal2 () ; <action2> ( c, d ) ;
If the rule, terminal or action has the zero-tuple as a parameter, then the parameter tuple may be omitted, e.g.:
( a, b ) = rule3 ; terminal3 ; c = <action3> ;
In older versions of sid, it used to be possible to have ambiguous items, e.g.:
a = b ;
where b
was both a rule and a name. As local names may not shadow non-local and global names, then this is no longer a problem.
In each case, the data flow through the rule is indicated using names. In the previous example of a production, both alternatives have the same data flow: the call to mul-expr
returns one value, which is stored in the name v1
, and the call to expr
returns one value, which is stored in the name v2
. Both of these names are passed to the action (add
in the first alternative, subtract
in the second), which returns a value into the name v
(presumably the sum or difference of the previous two values). The name v
is the name of the result, so this will be returned as the result of the rule. The exception handler (which is invoked if something fails) calls the action ex
to produce the result v
.
It is necessary that the types of the data flow through the production are correct. It is also necessary to define all of the result names for the production in each of the alternatives in that production.
An anonymous rule is written in the same way as the body of a normal rule, e.g.:
list : () -> ( l : ListT ) = { n = number ; /* START ANONYMOUS RULE */ { ? = <at-eof> ; l = <make-list> ( n ) ; || comma ; l1 = list ; l = <cons> ( n, l1 ) ; ## l = <com-or-eof> ( n ) ; } ; /* END ANONYMOUS RULE */ } ;
An anonymous rule is always inlined.
The rule name may be followed by a sequence of definitions and declarations surrounded by the [
and ]
symbols (which are followed by the rest of the rule). In this case, the definitions are local to the rule, e.g.:
x-list [ x = { terminal1 ; terminal2 ; || terminal3 ; } ; ] = { x ; || x ; x-list ; } ;
In this case, the rule x
would be local to the rule x-list
and no other rule would be able to use it. In error messages, the name of the rule x
would be written as x-list::x
. All declarations and definitions that occur inside of the [
and ]
symbols have the scope of the enclosing rule, unless they are preceded by the ::
symbol, in which case they have global scope. This is particularly necessary for actions, as actions can only be defined with global scope.
It is also possible to define non-local names. These are declared as an identifier (the name), followed by the :
symbol, followed by another identifier (its type), in a similar manner to an entry in a type tuple. Non-local names are not allowed at the outermost level (so they may not be prefixed with the ::
symbol either). When a non-local name is defined, it is in scope for all of the rules in its scope that are defined after it is, plus its defining rule.
Non-local names have their values saved on entry to their defining rule, and the value will be restored when the rule is exited. This includes exiting the rule tail recursively or because of an exception (if the rule has an exception handler, the non-local name will not be restored until the exception handler has exited). In almost all other respects non-local names are the same as local names. An example follows:
rule1 [ name1 : Type1T ; rule1_1 = { <action1> ( name1 ) ; rule2 ( name1 ) ; } ; ] = { <action2> ( &name1 ) ; rule1_1 ; } ;
It is possible to associate an initialiser with a non-local name, by following the type name with a =
symbol and the action name in angle brackets, e.g.:
rule1 [ name1 : Type1T = <action1> ; ] = { // .... } ;
In this case the action is called at the start of the rule to initialise the non-local name. The action should return an object of the same type as the non-local name. Normally, the action takes no parameters, however it may take one parameter of the same type as the non-local name (or a reference to that type), in which case it will be given the saved value of the non-local name as an argument (this may be used to build a stack automatically for example).
3.5. The grammar entry points section
The final section lists the entry points to the grammar. It begins with the section header, followed by a comma separated list of rule names, terminated by a semicolon, e.g.:
%entry% expr ;
If you are going to use a rule as an entry point into the grammar (i.e. you wish to call the associated function), you must list it in the entry points list. If not, the function may not exist.
4. The C information file
- 4.1. Lexical conventions
- 4.2. The prefixes section
- 4.3. The persistent section
- 4.4. The maps section
- 4.5. The header section
- 4.6. The assignments section
- 4.7. The parameter assignments section
- 4.8. The result assignments section
- 4.9. The terminal result extraction section
- 4.10. The action definition section
- 4.11. The trailer section
The grammar specification itself is not sufficient to produce a parser. There also needs to be output language specific information to allow the parser to interface with the program it is to be part of. In the case of the C output routines, sid needs to know the following information:
-
What code should precede and succeed the automatically generated code.
-
How to map the sid identifiers into C identifiers.
-
How to do assignments for each type.
-
How to get the current terminal number.
-
How to get the result of the current terminal.
-
How to advance the lexical analyser, to get the next terminal.
-
What the actions are defined as, and how to pass parameters to them.
-
How to save and restore the current terminal when an error occurs.
Eventually almost all of this should be user suppliable. At the moment, some of the information is supplied by the user in the C information file, some through macros, and some is built in. sid currently gets the information as follows:
-
The C information file has a header and a trailer section, which define code that precedes and succeeds the code that sid generates.
-
The C information file has a section that allows the user to specify mappings from sid identifiers into C identifiers. These are only valid for the following types of identifiers: types, functions (implementations of rules) and terminals. For other identifier types (or when no mapping is supplied), sid uses some default rules:
Firstly, sid applies a transform to the sid identifier name, to make it a legal C identifier. At present this maps
_
to__
,-
to_H
and:
(this occurs in scoped names) to_C
. All other characters are left unmodified. This transform cannot be changed.sid also puts a prefix before all identifiers, to try to prevent clashes (and also to make automatically generated - i.e. numeric - identifiers legal). These prefixes can be redefined for each class of identifier, in the C information file. They should be chosen so as not to clash with any other identifiers (i.e. no other identifiers should begin with that prefix).
By default, the following prefixes are used:
Prefix Meaning ZT
This prefix is used before type identifiers, for the type name itself. ZR
This prefix is used before rule identifiers, for the rule's implementation function. ZL
This prefix is used before rule identifiers, for the rule's label when tail recursion is being eliminated. In this case, a number is added to the suffix before the identifier name, to prevent clashes when a rule is inlined twice in the same function. It is also used before other labels that are automatically generated and are just numbered. ZI
This prefix is used before name identifiers used as parameters to functions, or in normal usage. It is also used by non-local names (which doesn't cause a problem as they always occur scoped, and local names never do). ZO
This prefix is used before name identifiers used as results of functions. Results are passed as reference parameters, and this suffix is used then. Another identifier with the ZI
prefix is also used within the function, and the type reference assignment operator is used at the end of the function to assign the results to the reference parameters.ZB
This prefix is used before the terminal symbol names in the generated header file. Table 1. Identifier prefixes -
Normally, sid will do assignments using the C assignment operator. Sometimes, this will not do the right thing, so the user can define a set of assignment operations for any type in the C information file.
-
sid expects the
CURRENT_TERMINAL
macro to be defined, and its definition should return an integer that is the current terminal. The macro should be an expression, not a statement. -
It is necessary to define how to extract the results of all terminals in the C information file (if a terminal doesn't return anything, then it is not necessary to define how to get the result).
-
sid expects the
ADVANCE_LEXER
macro to be defined, and its definition should cause the lexical analyser to read a new token. The new terminal number should be accessible through theCURRENT_TERMINAL
macro. On entry into the parserCURRENT_TERMINAL
should give the first terminal number. -
All actions, and their parameter and result names are defined in the C information file.
-
sid expects the
SAVE_LEXER
andRESTORE_LEXER
macros to be defined. The first is called with an argument which is the error terminal value. The macro should save the current terminal's value, and set the current terminal to be the error terminal value. The second macro is called without arguments, and should restore the saved value of the current terminal.SAVE_LEXER
will never be called more than once without a call toRESTORE_LEXER
, so the save stack only needs one element. -
sid expects the
ERROR_TERMINAL
macro to be defined if the-s no-numeric-terminals
option is given. This is expected to specify a value which is not a valid terminal number. This is required as with non-numeric terminals (that is, symbolic names) a non-terminal value would not otherwise be known.
The remainder of this section describes the layout of the C information file. The lexical conventions are described first, followed by a description of the sections in the order in which they should appear. Unlike the sid grammar file, not all sections are mandatory.
4.1. Lexical conventions
The lexical conventions of the C information file are very similar to those of the sid grammar file. There is a second class of identifier: the C identifier, which is a subset of the valid sid identifiers; there is also the C code block.
A C code block begins with @{
and is terminated by @}
. The code block consists of all of the characters between the start and end of the code block, subject to substitutions. All substitutions begin with the @
character. The following substitutions are recognised:
@@
-
This substitutes the
@
character itself. @:label
-
This form marks a label, which will be substituted for in the output code. This is necessary, because an action may be inlined into the same function more than once. If this happens, then without doing label substitution there would be two identical labels in the same scope. With label substitution, this problem is avoided. In general, all references to a label within an action should be prefixed with
@:
. This substitution may not be used in header and trailer code blocks. @identifier
-
This form marks a parameter or result identifier substitution. If parameter and result identifiers are not prefixed with an
@
character, then they will not be substituted. It is an error if the identifier is not a parameter or a result. Header and trailer code blocks have no parameters or results, so it is always an error to use identifier substitution in them. It is an error if any of the result identifiers are not substituted at least once.Result identifiers may be assigned to using this form of identifier substitution, but parameter identifiers may not be (nor may their address be taken - they are immutable). To try to prevent this, parameters that are substituted may be cast to their own type, which makes them unmodifiable in ISO C (see the notes on the
casts
language specific option). @&identifier
-
This form marks a parameter identifier whose address is to be substituted, but whose contents will not be modified. The effects of modifying the identifier are undefined. It is an error to use this in parameter assignment operator definitions.
@=identifier
-
This form marks a parameter identifier that will be modified. For this to be useful, the parameter should be a call by reference parameter, so that the effect of the modification will be propagated. This substitution is only valid in actions (assignment operators are not allowed to modify their parameters).
@!
-
This form marks an exception raise. In the generated code, a jump to the current exception handler will be substituted. This substitution is only valid in actions and terminal extraction rules.
@.
-
This form marks an attempt to access the current terminal. This substitution is only valid in actions.
@>
-
This form marks an attempt to advance the lexical analyser. This substitution is only valid in actions.
@$terminal
-
This form introduces a terminal, as would be referenced by the parser itself. This serves two purposes; firstly it acts as a convenience for consistency to the grammar (as opposed to writing the underlying C symbols), and secondly the expansion of
@$
is subject to the same rules as references to terminals elsewhere in the grammar. Most notably, this includes-s numeric-terminals
causing the terminal name to expand numerically.
All other forms are illegal. Note that in the case of labels and identifiers, no white space is allowed between the @:
, @
, @&
or @=
and the identifier name. An example of a code block is:
@{ /* A code block */ { int i ; if ( @param ) { @! ; } @result = 0 ; for ( i = 0 ; i < 100 ; i++ ) { printf ( "{%d}\n", i ) ; @result += i ; } @=param += @result ; if ( @. == @$SEMI ) { @> ; } } @}
4.2. The prefixes section
The first section in the C information file is the prefix definition section. This section is optional. It begins with the section header, followed by a list of prefix definitions. A prefix definition begins with the prefix name, followed by a =
symbol, followed by a C identifier that is the new prefix, and terminated by a semicolon. The following example shows all of the prefix names, and their default values:
%prefixes% type = ZT ; function = ZR ; label = ZL ; input = ZI ; output = ZO ; terminal = ZB ;
4.3. The persistent section
sid supports passing local variables through rules in the grammar, which are eventually passed on to actions. This helps keep the generated parser thread-safe, since each variable may be passed through from the entry point. However in practise, often grammars tend to build up a structure which is conceptually global to all rules (commonly some sort of parse tree
under construction). To pass this through each rule in the grammar and on to all actions is certainly possible, but a little inconvenient:
rule: ( l1 : ParsetreeT, ... ) -> ( ... ) = { ... }
However, adding this declaration to each rule and action would be tiresome and error-prone. Instead, it reads more naturally to view this variable as if it were global
. This keeps it out of the grammar entirely, as it is a concept specific to the action file. Persistent variables provide a mechanism to automate this process of passing-through as described above, whilst leaving the grammar file untouched. (This could be done by hand, though it would require passing variables through rules in the grammar; hence persistent variables are not necessary, but merely nice to have.)
From a user's perspective, persistent variables act as globals specific to each invocation. They are accessible by every rule and every parsing instance. Since they originate from an entry point, they persist only for each invocation of an entry into the parser.
Persistent variables are declared in their own section. This section is optional.
%persistents% pv1 : Type1T ; pv2 : Type2T ;
The persistent variables declared may be used in actions in the same manner as actions' parameters:
<append-node> : ( l1 : Type3 ) -> ( ) = @{ f ( @pv1, @pv1, @l1 ) ; @} ;
These are passed in at the entry point to the parser.
Since the ADVANCE_LEXER
macro is expanded inside generated functions that represent rules, it too may access persistent variables, as they are in scope in all rules.
4.4. The maps section
The section that follows the prefixes section is the maps section. This section is also optional. It begins with its section header, followed by a list of identifier mappings. An identifier mapping begins with a sid identifier (either a type, a rule or a terminal), followed by the ->
symbol, followed by the C identifier it is to be mapped to, and terminated by a semicolon. An example follows:
%maps% NumberT -> unsigned ; calculator -> calculator ;
Note that it is not possible to map type identifiers to be arbitrary C types. It will be necessary to typedef
or macro define the type name in the C file.
It is recommended that all types, terminals and entry point rules have their names mapped in this section, although this is not necessary. If the names are not mapped, they will have funny names in the rest of the program.
4.5. The header section
After the maps section comes the header section. This begins with the section header, followed by a code block, followed by a comma, followed by a second code block, and terminated with a semicolon. The first code block will be inserted at the beginning of the generated parser file; the second code block will be inserted at the start of the generated header file. An example is:
%header% @{ #include "lexer.h" LexerT token ; #define CURRENT_TERMINAL token.t #define ADVANCE_LEXER next_token () extern void terminal_error () ; extern void syntax_error () ; @}, @{ @} ;
4.6. The assignments section
The assignments section follows the header section. This section is optional. Normally, assignment between two identifiers will be done using the C assignment operator. In some cases this will not do the correct thing, and it is necessary to do the assignment differently. All types for which this applies should have an entry in the assignments section. The section begins with its header, followed by definitions for each type that needs its own assignment operator. Each definition should have one parameter, and one result. The action's name should be the name of the type. An example follows:
%assignments% ListT : ( l1 ) -> ( l2 ) = @{ if ( @l2.head = @l1.head ) { @l2.tail = @l1.tail ; } else { @l2.tail = &( @l2.head ) ; } @} ;
If a type has an assignment operator defined, it must also have a parameter assignment operator type defined and a result assignment operator defined (more precisely it must have either no assignment operations defined, or all three assignment operations defined).
4.7. The parameter assignments section
The parameter assignments section is very similar to the assignments section (which it follows), and is also optional. If a type has an assignment section entry, it must have a parameter assignment entry as well.
The parameter assignment operator is used in function calls to ensure that the object is copied correctly: if no parameter assignment operator is provided for a type, the standard C call by copy mechanism is used; if a parameter assignment operator is provided for a type, then the address of the object is passed by the calling function, and the called function declares a local of the same type, and uses the parameter assignment operator to copy the object (this should be remembered when passing parameters to entry points that have arguments of a type that has a parameter assignment operator defined).
The difference between the parameter assignment operator and the assignment operator is that the parameter identifier to the parameter assignment operator is a pointer to the object being manipulated, rather than the object itself. An example reference assignment section is:
%parameter-assignments% ListT : ( l1 ) -> ( l2 ) = @{ if ( @l2.head = @l1->head ) { @l2.tail = @l1->tail ; } else { @l2.tail = &( @l2.head ) ; } @} ;
4.8. The result assignments section
The result assignments section is very similar to the assignments section and the parameter assignments section (which it follows), and is also optional. If a type has an assignment section entry, it must also have a result assignment entry. The only difference between the two is that the result identifier of the result assignment operation is a pointer to the object being manipulated, rather than the object itself. Result assignments are only used when the results of a rule are assigned back through the reference parameters passed into the function. An example result assignment section is:
%result-assignments% ListT : ( l1 ) -> ( l2 ) = @{ if ( @l2->head = @l1.head ) { @l2->tail = @l1.tail ; } else { @l2->tail = &( @l2->head ) ; } @} ;
4.9. The terminal result extraction section
The terminal result extraction section follows the reference assignment section. It defines how to extract the results from terminals. The section begins with its section header, followed by the terminal extraction definitions.
There must be a definition for every terminal in the grammar that returns a result. It is an error to include a definition for a terminal that doesn't return a result. The result of the definition should be the same as the result of the terminal. An example of the terminal result extraction section follows:
%terminals% number : () -> ( n ) = @{ @n = token.u.number ; @} ; identifier : () -> ( i ) = @{ @i = token.u.identifier ; @} ; string : () -> ( s ) = @{ @s = token.u.string ; @} ;
4.10. The action definition section
The action definition section follows the terminal result extractor definition section. The format is similar to the previous sections: the section header followed by definitions for all of the actions. An action definition has the following form:
<action-name> : ( parameters ) -> ( results ) = code-block ;
This is similar to the form of all previous definitions, except that the name is surrounded in angle brackets. What follows is also true of the other definitions as well (unless they state otherwise).
The action-name
is a sid identifier that is the name of the action being defined; parameters
is a comma separated list of C identifiers that will be the names of the parameters passed to the action, and results
is a comma separated list of C identifiers that will be the names of the result parameters passed to the action. The code-block
is the C code that defines the action. It is expected that this will assign a valid result to each of the result identifier names.
The parameter and result tuples have the same form as in the language independent file, except that the types are optional. Like the language independent file, if the type of an action is zero-tuple to zero-tuple, then the type can be omitted, e.g.:
<action> = @{ /* .... */ @} ;
An example action definition section is:
%actions% <add> : ( v1, v2 ) -> ( v3 ) = @{ @v3 = @v1 + @v2 ; @} ; <subtract> : ( v1 : NumberT, v2 : NumberT ) -> ( v3 : NumberT ) = @{ @v3 = @v1 - @v2 ; @} ; <multiply> : ( v1 : NumberT, v2 ) -> ( v3 ) = @{ @v3 = @v1 * @v2 ; @} ; <divide> : ( v1, v2 ) -> ( v3 : NumberT ) = @{ @v3 = @v1 / @v2 ; @} ; <print> : ( v ) -> () = @{ printf ( "%u\n", @v ) ; @} ; <error> = @{ fprintf ( stderr, "ERROR\n" ) ; exit ( EXIT_FAILURE ) ; @} ;
Do not define static variables in action definitions; if you do, you will get unexpected results. If you wish to use static variables in actions definitions, then define them in the header block.
4.11. The trailer section
After the action definition section comes the trailer section. This has the same form as the header section. An example is:
%trailer% @{ int main () { next_token () ; calculator ( NULL ) ; return 0 ; } @}, @{ @} ;
The code blocks will be appended to the generated parser, and the generated header file respectively.
5. Features
This chapter draws attention to a few of the more interesting features of sid-specified grammars, and how they may be used.
5.1. Predicates
Predicates provide the user with a mechanism for altering the control flow in a manner that terminals alone cannot do.
During the factorisation process, rules that begin with predicates are expanded if necessary to ensure that predicates that may be used to select which alternative to go down always begin the alternative, e.g.:
rule1 = { rule2 ; /* .... */ || /* .... */ } ; rule2 = { ? = <predicate> ; /* .... */ || /* .... */ } ;
would be expanded into:
rule1 = { ? = predicate ; /* .... */ /* .... */ || /* .... */ /* .... */ || /* .... */ } ;
Also, if a predicate is used to select which alternative to use, it must be the first thing in the alternative, so the following would not be allowed:
rule = { <action> ; ? = <predicate> ; /* .... */ || /* .... */ } ;
When predicates begin a rule, they are executed (in some arbitrary order) until one of them returns true. The alternative that this predicate begins is then selected. If no predicates return true, then one of the remaining alternatives is selected based upon the current terminal (or an error occurs).
It is important that predicates do not contain dependencies upon the order of evaluation. In practice, predicates are likely to be simple, so this shouldn't be a problem.
When predicates are used within an alternative, they behave like terminals. If they evaluate to true, then parsing continues. If they evaluate to false, then an exception is raised.
5.2. Error handling
If the input given to the parser is valid, then the parser will not need to produce any errors. Unfortunately this is not always the case, so sid provides a mechanism for handling errors.
When an error occurs, an exception is raised. This passes control to the nearest enclosing exception handler. If there is no exception handler at all, the entry point function will return with the current terminal set to the error value.
An exception handler is just an alternative that is executed when a terminal or predicate fails. This should obviate the need to rely upon language specific mechanisms (such as setjmp
and longjmp
) for error recovery.
5.3. Call by reference
The default behaviour of sid is to do argument passing using call by copy semantics, and to not allow mutation of parameters of rules and actions (however inlined rules, and rules created during factoring have call by reference parameters). However it is possible to give rule and action parameters call by reference semantics, using the &
symbol in the type specification (as described earlier). It is also possible to mutate parameters of actions, using the @=
substitution in the action body (also described earlier). It is important to do the correct substitutions in action definitions, as sid uses this information to decide where it can optimise the output code.
If a call by copy parameter is mutated, then sid will introduce a new temporary variable and copy the parameter into it - this temporary will then be mutated. Similar code will be output for rules that have call by copy parameters that are mutated (e.g. as a call by reference argument to an action that mutates its parameters).
5.4. Calling entry points
When calling a function that implements an entry point rule, it should be called with the rule's parameters as the first arguments, followed by the addresses of the rule's results as the remaining arguments. The parameters should have their addresses passed if they are of a type that has a parameter assignment operator defined, or if the parameter is a call by reference parameter.
For example, given the following rule:
rule1 : ( :Type1T, :Type2T, :Type3T & ) -> ( :Type4T ) ;
where Type2T
has a parameter assignment operator defined, and rule1
is mapped to rule1
(and the type names are mapped to themselves), the call would be something like:
Type1T a = make_type1 () ; Type2T b = make_type2 () ; Type3T c = make_type3 () ; Type4T d ; rule1 ( a, b, &c, &d ) ;
A. Understanding error messages
- A.1. Left recursion elimination errors
- A.2. First set computation errors
- A.3. Factoring errors
- A.4. Checking errors
This section tries to explain what some of the error messages that are reported by the sid transforms mean. It does not contain descriptions of messages like "type 'some type' is unknown", as these should be self-explanatory.
A.1. Left recursion elimination errors
- The parameter or result types of the left recursive calls in the following productions do not match: PRODUCTIONS:
-
This means that there is a set of rules which call each other left recursively (i.e. the first item in some of the alternatives in each rule is a call to another rule in the set), and they do not all have the same parameter and result types, e.g.:
rule1 : ( a : Type1T, b : Type1T, c : Type2T, d : Type2T ) -> () = { rule2 ( a, b ) ; || terminal1 ; } ; rule2 : ( a : Type1T, b : Type2T ) -> () = { rule1 ( a, a, b, b ) ; || terminal2 ; } ;
- The exception handlers in the left recursion involving the following productions do not match: PRODUCTIONS:
-
This means that there is a set of productions which call each other left recursively (i.e. the first item in an alternative is a call to another production in the set), and they do not all have the same exception handler, e.g.:
rule1 = { rule2 ; || terminal1 ; ## <action1> ; } ; rule2 = { rule1 ; || terminal2 ; ## <action2> ; } ;
It is quite likely that when using exception handlers, it may be necessary to do the left recursion elimination manually to ensure that the exception handlers occur at the correct place.
- The argument names of the left recursive calls in the following productions do not match: PRODUCTIONS:
-
This means that there is a set of productions which call each other left recursively (i.e. the first item in an alternative is a call to another production in the set), and the arguments to one of the left recursive calls are not the same as the parameters of the calling rule, e.g.:
rule1 : ( a : Type1T, b : Type1T ) -> () = { rule1 ( b, a ) ; || terminal1 ; } ;
- A non-local name in the rule 'RULE' is not in scope in the rule 'RULE' in the left recursive cycle involving the following productions: PRODUCTIONS:
-
This means that there is a set of productions which call each other left recursively (i.e. the first item in an alternative is a call to another production in the set), and the first named rule uses a non-local name that is not in scope in the second named rule, e.g.:
rule1 [ name1 : Type1T ; rule1_1 [ name1_1 : Type1T ; ] = { rule1 ; <action1_1> ( name1_1 ) ; || terminal1 ; } ; ] = { terminal2 ; || rule1_1 ; <action1> ( name1 ) ; } ;
- The rule 'RULE' declares non-local names in the left recursive cycle with more than one entry point involving the following productions: PRODUCTIONS:
-
This means that there is a set of productions which call each other left recursively (i.e. the first item in an alternative is a call to another production in the set), and the named rule defines non-local variables even though it is not the only entry point to the cycle, e.g.:
rule1 [ name1 : Type1T ; rule1_1 = { <action1_1> ( name1 ) ; } ; ] = { terminal1 ; rule1_1 ; || rule2 ; <action1> ( name1 ) ; } ; rule2 = { rule1 ; <action2> ; || terminal2 ; } ;
- No cycle termination for the left recursive set involving the following rules: RULES:
-
This means that there is a set of productions which call each other left recursively (i.e. the first item in an alternative is a call to another production in the set), and they do not contain an alternative that begins with a non-left recursive call, e.g.:
rule1 = { rule2 ; || rule3 ; } ; rule2 = { rule1 ; || rule3 ; } ; rule3 = { rule1 ; || rule2 ; } ;
A.2. First set computation errors
- Cannot compute first set for PRODUCTION:
-
This means that sid cannot compute the set of terminals and predicates that may start the production. This is normally because there is a recursive call (or cycle) that contains no terminals, e.g.:
rule1 = { <action1> ; rule1 ; || terminal1 ; } ;
This is not removed by the left recursion elimination phase, as the call is not the leftmost item in the alternative.
- Can see through to predicate 'PREDICATE' in production PRODUCTION:
-
This means that there is a predicate that isn't the first item in its alternative, but is preceded only by see-through items, e.g.:
rule1 = { <action1> ; ? = <predicate> ; || terminal1 ; } ;
- Can see through to predicates in rule 'RULE' in production PRODUCTION:
-
This means that the first rule has at least one predicate in its first set, and the second rule calls it in a position where it is not the first item in the alternative and is preceded only by see-through items, e.g.:
rule1 = { ? = <predicate> ; || terminal1 ; } ; rule2 = { <action> ; rule1 ; || terminal2 ; } ;
- The rule 'RULE' has all terminals in its first set and has a redundant see-through alternative:
-
This means that the rule's first set (the set of all terminals that can start the rule) includes all possible input terminals, and the rule also has a see-through alternative. The see-through alternative will never be used, as one of the other alternatives will always be chosen.
A.3. Factoring errors
- Too many productions (NUMBER) created during factorisation:
-
This normally means that sid cannot factor the grammar. You will need to rewrite the offending part. Unfortunately there is no easy way to do this. Start by looking at the dump file for a set of rules that seem to have been expanded a lot.
- The rule 'RULE' cannot be expanded into 'RULE' as the exception handlers don't match:
-
When sid performs factoring, it needs to expand calls to certain rules into the rules that calls them (this is described in the overview section). If the called rule has an exception handler and it is not the same as the exception handler of the calling rule, then the expansion will fail.
- The rule 'RULE' cannot be expanded into 'RULE' as it contains non-local name definitions:
-
When sid performs factoring, it needs to expand calls to certain rules into the rules that calls them (this is described in the overview section). If the called rule defines any non-local names, then the expansion will fail.
A.4. Checking errors
- Collision of terminal(s) TERMINALS in rule 'RULE':
-
This error means that more than one alternative in the named rule begins with the named terminals, e.g.:
rule1 = { <action1> ; terminal1 ; || terminal1 ; } ;
Normally, the factoring process will remove the problem, but when something like the above happens to stop the factoring occurring, this error will be produced.
- Collision of predicate 'PREDICATE' in rule 'RULE':
-
This error occurs when more than one alternative in the named rule begins with the named predicate, e.g.:
rule1 = { ( a, ? ) = <predicate> ; <action1> ( a ) ; || ( ?, b ) = <predicate> ; <action2> ( b ) ; } ;
Again, it is normally the case that the factoring process will remove this problem, but if the same predicate uses different predicate results in different alternatives, this error will be produced.
- The terminal(s) TERMINALS can start rule 'RULE' which is see-through, and the same terminal(s) may appear in the following situations: ALTERNATIVES:
-
This means that there are one or more terminals that can start the named rule (which is see-through), and may also follow it, e.g.:
rule1 = { terminal1 ; || $ ; } ; rule2 = { rule1 ; terminal1 ; || terminal2 ; } ;
The alternatives listed are the alternatives which call the rule, and contain (some of) the named terminals after the call. The call is highlighted.
- The predicate(s) PREDICATES can start rule 'RULE' which is see-through and the same predicate(s) may appear in the following situations: ALTERNATIVES:
-
This means that there are one or more predicates that can start the named rule (which is see-through), and may also follow it, e.g.:
rule1 = { ? = <predicate> ; || $ ; } ; rule2 = { terminal1 ; rule1 ; ? = <predicate> ; || terminal2 ; } ;
The alternatives listed are the alternatives which call the rule, and contain (some of) the named predicates after the call. The call is highlighted.
- The rule 'RULE' contains more than one see-through alternative:
-
This error occurs if the rule has more than one alternative that doesn't need to read a terminal or a predicate, e.g.:
rule1 = { <action1> ; || <action2> ; } ;
B. Advice on writing parsers with sid
- B.1. Handling EOF
- B.2. Adding common mistakes to the grammar
- B.3. Throwing exceptions inside exception handlers
- B.4. Continuing from an exception handler
- B.5. Manifesting semantic checks as syntax errors
- B.6. Implementing “panic mode”
- B.7. Duplicating token values
This appendix lists a few points of advice for writing parsers using sid. This list is intended to cover a few of the areas which might be confusing (particularly to somebody coming from a background with another parser generator), and an explanation of generally what is considered good technique when using sid. Hopefully this list will expand over time.
In general, the examples included with sid are designed to help orientate the reader and illustrate common practises. Also look at the implementation of sid itself (both its .sid
file and .act
file parsers are written using itself), which demonstrates more complex uses. There are also several other programs using sid throughout the TenDRA repository which may be of interest, not least of which is the compiler itself.
B.1. Handling EOF
Given the action:
<is-eof>: () -> (eof:Boolean) = @{ @eof = (@. == @$EOF); @};
used in a grammar as a predicate, we can have an empty alternative which is only reached during EOF. This can be used as a “get-out” clause to end recursion; for example, in a shell-like language, the entry point may be expressed as:
script = { ? = <is-eof>; || list-of-statements; }; %entry% script;
where list-of-statements
is the usual recursive structure for expressing lists. Since predicates used in alternatives only permit parsing to continue if they return true, the <is-eof>
alternative is only parsed if EOF has been reached. Otherwise, the predicate raises an exception.
Usually EOF would be considered an error if encountered during a more complex rule, and hence would not appear elsewhere in the grammar. Similarly a parser for a file format would expect an eof
token at the end of its input stream. sid itself is a good example for this typical requirement; the last rule for its entry point reads simply:
{ eof; ## <expected-eof>; };
where <expected-eof>
produces an error.
B.2. Adding common mistakes to the grammar
For a given language there is often a well-known set of common mistakes people make. One approach to producing more useful error messages is to add these well-known mistakes as rules in the grammar, so that these specific cases may be identified. (These production would of course go on to call actions that raise an error).
This technique is particularly appropriate for errors which are not simply “missing” items. Instead, those are best handled with empty alternatives, such as:
{ semicolon; ## <err-expected-semicolon>; }
The approach of encoding common mistakes into the grammar as productions also lends itself particularly well to warnings; for example in C a missing semicolon after a struct at the end of a header, or the semicolon in for (...); { ... }
.
B.3. Throwing exceptions inside exception handlers
Error handling may be centralised to a higher-level point whilst keeping the specific errors which caused the parse to fail local to their respective areas. This allows for specific error messages without having to repeat the same handling mechanisms. One way to achieve this is to have local error handlers call actions which raise a new exception (by calling the @!
command). This newly-raised exception may then be caught by an error handler further up the grammar.
B.4. Continuing from an exception handler
An exception handler in sid may contain rules just as any other alternative does, so that the parse may continue after an error. This is helpful for having the parser perform semantic checks for non-syntax errors such as overflows or type mismatches:
variable-declaration = { v = variable; equals; l = literal; <instantiate-variable>(v, l); }; statement = { variable-declaration; || ... }; list-of-statements = { statement; || statement; semicolon; list-of-statements; ## list-of-statements; };
Here the <instantiate-variable>
action may raise an exception (say, that the literal used to instantiate the variable is of an incompatible type); this exception bubbles up in the grammar to the list-of-statements
rule. Since the error is relatively minor (especially as it is syntactically valid), rather than abort at this point it may be more convenient to the user to continue parsing in order to identify any other possible errors later on in the input.
In the example above, the most sensible approach seems to be to recur into the list-of-statements
rule when such errors are encountered. In this case, this permits additional valid lists of statements to follow.
It is also especially good for interpreters, where exiting the program would be inappropriate. In the case of the calculator example, which reads and executes expressions, after a user has entered something invalid the parser is recurred through an error handler all of the time for the remainder of execution.
B.5. Manifesting semantic checks as syntax errors
Often grammars express constructs which are syntactically identical, but who's validity depends on other context. For example, a language offering declarations of several optional items may consider re-declaring the same item as an error. One approach is to simply list these items in order in the grammar, and make each optional (as sid's grammar does for the %types%
, %terminals%
etc sections).
Another approach would be to consider the order of these irrelevant, as long as each item is present once at most. Hence in the following example, A
and/or B
may be present, in any order, but it is a syntax error to define either A
or B
twice:
rule = { ? = <a-not-defined>; A; || ? = <b-not-defined>; B; ## rule; /* TODO check this: to handle predicates raising on false */ };
The predicates responsible for the semantic checks direct the syntax accordingly, altering what is henceforth considered to be a valid parse. A following redefinition will be a syntax error according to the permissible rules in the grammar (now excluding the predicates' alternatives). It (perhaps) seems natural to present these redefinitions as syntax errors to the user.
B.6. Implementing “panic mode”
When implementing parsers for compilers, it is often more useful to the user to try to continue a little way rather than exit on the first error encountered. This can help show unrelated errors, so that the user may address those whilst debugging and not resort to tediously dealing with a single error for each compilation attempt.
The traditional scheme for providing this is to implement a “panic mode” for parsing which returns to a high-level parsing rule when a syntax or lexical error is detected. Commonly it is sensible to skip the erroneous region and re-synchronise the input stream at a statement rule (or whatever suits the grammar).
After a syntax error has been encountered, panic mode is provided for by skipping over tokens until a statement separator (commonly a semicolon) is found [a] and tokens are discarded until the separator is reached. The parse may then continue. This is not perfect, but often helpful enough to the user.
The exception handling mechanism of sid provides a means to change the parse for error situations (without resorting to non-local goto
commands such as longjmp
for C); this exception handle may be entered from any of the children of the given rule:
rule = { ... ## <panic>; }
In sid, panic mode is quite naturally implemented by providing an action which performs this token-skipping:
<panic>: () -> () = @{ while (@. != @$SEMI && @. != @$EOF) { /* Advance to the next token */ @>; } @};
For larger languages, it may be sensible to skip until more appropriate tokens depending which point in the grammar has failed to parse. This works nicely with exception handlers strategically placed throughout the grammar, each calling an action to skip to an appropriate token. The parser for sid's .sid
files serve as a good example for this.
B.7. Duplicating token values
Often a parser needs to retrieve the values lexed when producing a token - to obtain the spelling for identifiers, for example. Consider the following fragment:
rule = { w = word; command = <duplicate>(w); empty-list = <empty-list>; argument-list = list-of-words(empty-list); <do-execute>(execute, command, argument-list); }
Here the terminal word
produces a value (the spelling for a word, as stored by the lexer), which is passed to the <duplicate>
action:
<duplicate>: (in :STRING) -> (out :STRING) = @{ @out = strdup(@in); if (@out == NULL) { perror("strdup"); exit(EXIT_FAILURE); } @};
So the value from the token buffer which stored this spelling is duplicated in memory such that the buffer may be reused for the next token. Or at least, that was the intention.
As it turns out, what actually happens is that the one-token look-ahead for the LL(1) grammar causes the parser to advance one token before executing the <duplicate>
action. So when <duplicate>
is called, a new token has already been read into the buffer. This gives the effect of these tokens appearing to be “along one”, as each value is one ahead of where it ought to be.
To avoid this, we can have the word
terminal perform this duplication itself, and instead simply call:
w = word; empty-list = <empty-list>; argument-list = list-of-words(empty-list); <do-execute>(execute, w, argument-list);
This also reads more naturally as the need for a separate <duplicate>
action is removed.
- [a]
Apparently this is why so many languages use semicolons.