The C information file

4. The C information file

4.1. Lexical conventions
4.2. The prefixes section
4.3. The persistent section
4.4. The maps section
4.5. The header section
4.6. The assignments section
4.7. The parameter assignments section
4.8. The result assignments section
4.9. The terminal result extraction section
4.10. The action definition section
4.11. The trailer section

The grammar specification itself is not sufficient to produce a parser. There also needs to be output language specific information to allow the parser to interface with the program it is to be part of. In the case of the C output routines, sid needs to know the following information:

What code should precede and succeed the automatically generated code.
How to map the sid identifiers into C identifiers.
How to do assignments for each type.
How to get the current terminal number.
How to get the result of the current terminal.
How to advance the lexical analyser, to get the next terminal.
What the actions are defined as, and how to pass parameters to them.
How to save and restore the current terminal when an error occurs.

Eventually almost all of this should be user suppliable. At the moment, some of the information is supplied by the user in the C information file, some through macros, and some is built in. sid currently gets the information as follows:

The C information file has a header and a trailer section, which define code that precedes and succeeds the code that sid generates.

The C information file has a section that allows the user to specify mappings from sid identifiers into C identifiers. These are only valid for the following types of identifiers: types, functions (implementations of rules) and terminals. For other identifier types (or when no mapping is supplied), sid uses some default rules:

Firstly, sid applies a transform to the sid identifier name, to make it a legal C identifier. At present this maps _ to __, - to _H and : (this occurs in scoped names) to _C. All other characters are left unmodified. This transform cannot be changed.

sid also puts a prefix before all identifiers, to try to prevent clashes (and also to make automatically generated - i.e. numeric - identifiers legal). These prefixes can be redefined for each class of identifier, in the C information file. They should be chosen so as not to clash with any other identifiers (i.e. no other identifiers should begin with that prefix).

By default, the following prefixes are used:

Prefix	Meaning
`ZT`	This prefix is used before type identifiers, for the type name itself.
`ZR`	This prefix is used before rule identifiers, for the rule's implementation function.
`ZL`	This prefix is used before rule identifiers, for the rule's label when tail recursion is being eliminated. In this case, a number is added to the suffix before the identifier name, to prevent clashes when a rule is inlined twice in the same function. It is also used before other labels that are automatically generated and are just numbered.
`ZI`	This prefix is used before name identifiers used as parameters to functions, or in normal usage. It is also used by non-local names (which doesn't cause a problem as they always occur scoped, and local names never do).
`ZO`	This prefix is used before name identifiers used as results of functions. Results are passed as reference parameters, and this suffix is used then. Another identifier with the `ZI` prefix is also used within the function, and the type reference assignment operator is used at the end of the function to assign the results to the reference parameters.
`ZB`	This prefix is used before the terminal symbol names in the generated header file.

Table 1. Identifier prefixes

Normally, sid will do assignments using the C assignment operator. Sometimes, this will not do the right thing, so the user can define a set of assignment operations for any type in the C information file.
sid expects the CURRENT_TERMINAL macro to be defined, and its definition should return an integer that is the current terminal. The macro should be an expression, not a statement.
It is necessary to define how to extract the results of all terminals in the C information file (if a terminal doesn't return anything, then it is not necessary to define how to get the result).
sid expects the ADVANCE_LEXER macro to be defined, and its definition should cause the lexical analyser to read a new token. The new terminal number should be accessible through the CURRENT_TERMINAL macro. On entry into the parser CURRENT_TERMINAL should give the first terminal number.
All actions, and their parameter and result names are defined in the C information file.
sid expects the SAVE_LEXER and RESTORE_LEXER macros to be defined. The first is called with an argument which is the error terminal value. The macro should save the current terminal's value, and set the current terminal to be the error terminal value. The second macro is called without arguments, and should restore the saved value of the current terminal. SAVE_LEXER will never be called more than once without a call to RESTORE_LEXER, so the save stack only needs one element.
sid expects the ERROR_TERMINAL macro to be defined if the -s no-numeric-terminals option is given. This is expected to specify a value which is not a valid terminal number. This is required as with non-numeric terminals (that is, symbolic names) a non-terminal value would not otherwise be known.

The remainder of this section describes the layout of the C information file. The lexical conventions are described first, followed by a description of the sections in the order in which they should appear. Unlike the sid grammar file, not all sections are mandatory.

4.1. Lexical conventions

The lexical conventions of the C information file are very similar to those of the sid grammar file. There is a second class of identifier: the C identifier, which is a subset of the valid sid identifiers; there is also the C code block.

A C code block begins with @{ and is terminated by @}. The code block consists of all of the characters between the start and end of the code block, subject to substitutions. All substitutions begin with the @ character. The following substitutions are recognised:

@@

This substitutes the @ character itself.

@:label

This form marks a label, which will be substituted for in the output code. This is necessary, because an action may be inlined into the same function more than once. If this happens, then without doing label substitution there would be two identical labels in the same scope. With label substitution, this problem is avoided. In general, all references to a label within an action should be prefixed with @:. This substitution may not be used in header and trailer code blocks.

@identifier

This form marks a parameter or result identifier substitution. If parameter and result identifiers are not prefixed with an @ character, then they will not be substituted. It is an error if the identifier is not a parameter or a result. Header and trailer code blocks have no parameters or results, so it is always an error to use identifier substitution in them. It is an error if any of the result identifiers are not substituted at least once.

Result identifiers may be assigned to using this form of identifier substitution, but parameter identifiers may not be (nor may their address be taken - they are immutable). To try to prevent this, parameters that are substituted may be cast to their own type, which makes them unmodifiable in ISO C (see the notes on the casts language specific option).

@&identifier

This form marks a parameter identifier whose address is to be substituted, but whose contents will not be modified. The effects of modifying the identifier are undefined. It is an error to use this in parameter assignment operator definitions.

@=identifier

This form marks a parameter identifier that will be modified. For this to be useful, the parameter should be a call by reference parameter, so that the effect of the modification will be propagated. This substitution is only valid in actions (assignment operators are not allowed to modify their parameters).

@!

This form marks an exception raise. In the generated code, a jump to the current exception handler will be substituted. This substitution is only valid in actions and terminal extraction rules.

@.

This form marks an attempt to access the current terminal. This substitution is only valid in actions.

@>

This form marks an attempt to advance the lexical analyser. This substitution is only valid in actions.

@$terminal

This form introduces a terminal, as would be referenced by the parser itself. This serves two purposes; firstly it acts as a convenience for consistency to the grammar (as opposed to writing the underlying C symbols), and secondly the expansion of @$ is subject to the same rules as references to terminals elsewhere in the grammar. Most notably, this includes -s numeric-terminals causing the terminal name to expand numerically.

All other forms are illegal. Note that in the case of labels and identifiers, no white space is allowed between the @:, @, @& or @= and the identifier name. An example of a code block is:

@{
	/* A code block */
	{
		int i ;

		if ( @param ) {
			@! ;
		}

		@result = 0 ;
		for ( i = 0 ; i < 100 ; i++ ) {
			printf ( "{%d}\n", i ) ;
			@result += i ;
		}

		@=param += @result ;
		if ( @. == @$SEMI ) {
			@> ;
		}
	}
@}

4.2. The prefixes section

The first section in the C information file is the prefix definition section. This section is optional. It begins with the section header, followed by a list of prefix definitions. A prefix definition begins with the prefix name, followed by a = symbol, followed by a C identifier that is the new prefix, and terminated by a semicolon. The following example shows all of the prefix names, and their default values:

%prefixes%
type      = ZT ;
function  = ZR ;
label     = ZL ;
input     = ZI ;
output    = ZO ;
terminal  = ZB ;

4.3. The persistent section

sid supports passing local variables through rules in the grammar, which are eventually passed on to actions. This helps keep the generated parser thread-safe, since each variable may be passed through from the entry point. However in practise, often grammars tend to build up a structure which is conceptually global to all rules (commonly some sort of parse tree under construction). To pass this through each rule in the grammar and on to all actions is certainly possible, but a little inconvenient:

rule: ( l1 : ParsetreeT, ... ) -> ( ... ) = {
	...
}

However, adding this declaration to each rule and action would be tiresome and error-prone. Instead, it reads more naturally to view this variable as if it were global. This keeps it out of the grammar entirely, as it is a concept specific to the action file. Persistent variables provide a mechanism to automate this process of passing-through as described above, whilst leaving the grammar file untouched. (This could be done by hand, though it would require passing variables through rules in the grammar; hence persistent variables are not necessary, but merely nice to have.)

From a user's perspective, persistent variables act as globals specific to each invocation. They are accessible by every rule and every parsing instance. Since they originate from an entry point, they persist only for each invocation of an entry into the parser.

Persistent variables are declared in their own section. This section is optional.

%persistents%

	pv1 : Type1T ;
	pv2 : Type2T ;

The persistent variables declared may be used in actions in the same manner as actions' parameters:

<append-node> : ( l1 : Type3 ) -> ( ) = @{
	f ( @pv1, @pv1, @l1 ) ;
@} ;

These are passed in at the entry point to the parser.

Since the ADVANCE_LEXER macro is expanded inside generated functions that represent rules, it too may access persistent variables, as they are in scope in all rules.

4.4. The maps section

The section that follows the prefixes section is the maps section. This section is also optional. It begins with its section header, followed by a list of identifier mappings. An identifier mapping begins with a sid identifier (either a type, a rule or a terminal), followed by the -> symbol, followed by the C identifier it is to be mapped to, and terminated by a semicolon. An example follows:

%maps%
NumberT    -> unsigned ;
calculator -> calculator ;

Note that it is not possible to map type identifiers to be arbitrary C types. It will be necessary to typedef or macro define the type name in the C file.

It is recommended that all types, terminals and entry point rules have their names mapped in this section, although this is not necessary. If the names are not mapped, they will have funny names in the rest of the program.

4.5. The header section

After the maps section comes the header section. This begins with the section header, followed by a code block, followed by a comma, followed by a second code block, and terminated with a semicolon. The first code block will be inserted at the beginning of the generated parser file; the second code block will be inserted at the start of the generated header file. An example is:

%header% @{
	#include "lexer.h"

	LexerT token ;

	#define CURRENT_TERMINAL token.t
	#define ADVANCE_LEXER    next_token ()

	extern void terminal_error () ;
	extern void syntax_error () ;
@}, @{
@} ;

4.6. The assignments section

The assignments section follows the header section. This section is optional. Normally, assignment between two identifiers will be done using the C assignment operator. In some cases this will not do the correct thing, and it is necessary to do the assignment differently. All types for which this applies should have an entry in the assignments section. The section begins with its header, followed by definitions for each type that needs its own assignment operator. Each definition should have one parameter, and one result. The action's name should be the name of the type. An example follows:

%assignments%

ListT : ( l1 ) -> ( l2 ) = @{
	if ( @l2.head = @l1.head ) {
		@l2.tail = @l1.tail ;
	} else {
		@l2.tail = &( @l2.head ) ;
	}
@} ;

If a type has an assignment operator defined, it must also have a parameter assignment operator type defined and a result assignment operator defined (more precisely it must have either no assignment operations defined, or all three assignment operations defined).

4.7. The parameter assignments section

The parameter assignments section is very similar to the assignments section (which it follows), and is also optional. If a type has an assignment section entry, it must have a parameter assignment entry as well.

The parameter assignment operator is used in function calls to ensure that the object is copied correctly: if no parameter assignment operator is provided for a type, the standard C call by copy mechanism is used; if a parameter assignment operator is provided for a type, then the address of the object is passed by the calling function, and the called function declares a local of the same type, and uses the parameter assignment operator to copy the object (this should be remembered when passing parameters to entry points that have arguments of a type that has a parameter assignment operator defined).

The difference between the parameter assignment operator and the assignment operator is that the parameter identifier to the parameter assignment operator is a pointer to the object being manipulated, rather than the object itself. An example reference assignment section is:

%parameter-assignments%

ListT : ( l1 ) -> ( l2 ) = @{
	if ( @l2.head = @l1->head ) {
		@l2.tail = @l1->tail ;
	} else {
		@l2.tail = &( @l2.head ) ;
	}
@} ;

4.8. The result assignments section

The result assignments section is very similar to the assignments section and the parameter assignments section (which it follows), and is also optional. If a type has an assignment section entry, it must also have a result assignment entry. The only difference between the two is that the result identifier of the result assignment operation is a pointer to the object being manipulated, rather than the object itself. Result assignments are only used when the results of a rule are assigned back through the reference parameters passed into the function. An example result assignment section is:

%result-assignments%

ListT : ( l1 ) -> ( l2 ) = @{
	if ( @l2->head = @l1.head ) {
		@l2->tail = @l1.tail ;
	} else {
		@l2->tail = &( @l2->head ) ;
	}
@} ;

4.9. The terminal result extraction section

The terminal result extraction section follows the reference assignment section. It defines how to extract the results from terminals. The section begins with its section header, followed by the terminal extraction definitions.

There must be a definition for every terminal in the grammar that returns a result. It is an error to include a definition for a terminal that doesn't return a result. The result of the definition should be the same as the result of the terminal. An example of the terminal result extraction section follows:

%terminals%

number : () -> ( n ) = @{
	@n = token.u.number ;
@} ;

identifier : () -> ( i ) = @{
	@i = token.u.identifier ;
@} ;

string : () -> ( s ) = @{
	@s = token.u.string ;
@} ;

4.10. The action definition section

The action definition section follows the terminal result extractor definition section. The format is similar to the previous sections: the section header followed by definitions for all of the actions. An action definition has the following form:

<action-name> : ( parameters ) -> ( results ) = code-block ;

This is similar to the form of all previous definitions, except that the name is surrounded in angle brackets. What follows is also true of the other definitions as well (unless they state otherwise).

The action-name is a sid identifier that is the name of the action being defined; parameters is a comma separated list of C identifiers that will be the names of the parameters passed to the action, and results is a comma separated list of C identifiers that will be the names of the result parameters passed to the action. The code-block is the C code that defines the action. It is expected that this will assign a valid result to each of the result identifier names.

The parameter and result tuples have the same form as in the language independent file, except that the types are optional. Like the language independent file, if the type of an action is zero-tuple to zero-tuple, then the type can be omitted, e.g.:

<action> = @{ /* .... */ @} ;

An example action definition section is:

%actions%

<add> : ( v1, v2 ) -> ( v3 ) = @{
	@v3 = @v1 + @v2 ;
@} ;

<subtract> : ( v1 : NumberT, v2 : NumberT ) -> ( v3 : NumberT ) = @{
	@v3 = @v1 - @v2 ;
@} ;

<multiply> : ( v1 : NumberT, v2 ) -> ( v3 ) = @{
	@v3 = @v1 * @v2 ;
@} ;

<divide> : ( v1, v2 ) -> ( v3 : NumberT ) = @{
	@v3 = @v1 / @v2 ;
@} ;

<print> : ( v ) -> () = @{
	printf ( "%u\n", @v ) ;
@} ;

<error> = @{
	fprintf ( stderr, "ERROR\n" ) ;
	exit ( EXIT_FAILURE ) ;
@} ;

Do not define static variables in action definitions; if you do, you will get unexpected results. If you wish to use static variables in actions definitions, then define them in the header block.

4.11. The trailer section

After the action definition section comes the trailer section. This has the same form as the header section. An example is:

%trailer% @{
	int main ()
	{
		next_token () ;
		calculator ( NULL ) ;
		return 0 ;
	}
@}, @{
@} ;

The code blocks will be appended to the generated parser, and the generated header file respectively.