4. The C information file

  1. 4.1. Lexical conventions
  2. 4.2. The prefixes section
  3. 4.3. The persistent section
  4. 4.4. The maps section
  5. 4.5. The header section
  6. 4.6. The assignments section
  7. 4.7. The parameter assignments section
  8. 4.8. The result assignments section
  9. 4.9. The terminal result extraction section
  10. 4.10. The action definition section
  11. 4.11. The trailer section

The grammar specification itself is not sufficient to produce a parser. There also needs to be output language specific information to allow the parser to interface with the program it is to be part of. In the case of the C output routines, sid needs to know the following information:

Eventually almost all of this should be user suppliable. At the moment, some of the information is supplied by the user in the C information file, some through macros, and some is built in. sid currently gets the information as follows:

The remainder of this section describes the layout of the C information file. The lexical conventions are described first, followed by a description of the sections in the order in which they should appear. Unlike the sid grammar file, not all sections are mandatory.

4.1. Lexical conventions

The lexical conventions of the C information file are very similar to those of the sid grammar file. There is a second class of identifier: the C identifier, which is a subset of the valid sid identifiers; there is also the C code block.

A C code block begins with @{ and is terminated by @}. The code block consists of all of the characters between the start and end of the code block, subject to substitutions. All substitutions begin with the @ character. The following substitutions are recognised:

@@

This substitutes the @ character itself.

@:label

This form marks a label, which will be substituted for in the output code. This is necessary, because an action may be inlined into the same function more than once. If this happens, then without doing label substitution there would be two identical labels in the same scope. With label substitution, this problem is avoided. In general, all references to a label within an action should be prefixed with @:. This substitution may not be used in header and trailer code blocks.

@identifier

This form marks a parameter or result identifier substitution. If parameter and result identifiers are not prefixed with an @ character, then they will not be substituted. It is an error if the identifier is not a parameter or a result. Header and trailer code blocks have no parameters or results, so it is always an error to use identifier substitution in them. It is an error if any of the result identifiers are not substituted at least once.

Result identifiers may be assigned to using this form of identifier substitution, but parameter identifiers may not be (nor may their address be taken - they are immutable). To try to prevent this, parameters that are substituted may be cast to their own type, which makes them unmodifiable in ISO C (see the notes on the casts language specific option).

@&identifier

This form marks a parameter identifier whose address is to be substituted, but whose contents will not be modified. The effects of modifying the identifier are undefined. It is an error to use this in parameter assignment operator definitions.

@=identifier

This form marks a parameter identifier that will be modified. For this to be useful, the parameter should be a call by reference parameter, so that the effect of the modification will be propagated. This substitution is only valid in actions (assignment operators are not allowed to modify their parameters).

@!

This form marks an exception raise. In the generated code, a jump to the current exception handler will be substituted. This substitution is only valid in actions and terminal extraction rules.

@.

This form marks an attempt to access the current terminal. This substitution is only valid in actions.

@>

This form marks an attempt to advance the lexical analyser. This substitution is only valid in actions.

@$terminal

This form introduces a terminal, as would be referenced by the parser itself. This serves two purposes; firstly it acts as a convenience for consistency to the grammar (as opposed to writing the underlying C symbols), and secondly the expansion of @$ is subject to the same rules as references to terminals elsewhere in the grammar. Most notably, this includes -s numeric-terminals causing the terminal name to expand numerically.

All other forms are illegal. Note that in the case of labels and identifiers, no white space is allowed between the @:, @, @& or @= and the identifier name. An example of a code block is:

@{
	/* A code block */
	{
		int i ;

		if ( @param ) {
			@! ;
		}

		@result = 0 ;
		for ( i = 0 ; i < 100 ; i++ ) {
			printf ( "{%d}\n", i ) ;
			@result += i ;
		}

		@=param += @result ;
		if ( @. == @$SEMI ) {
			@> ;
		}
	}
@}

4.2. The prefixes section

The first section in the C information file is the prefix definition section. This section is optional. It begins with the section header, followed by a list of prefix definitions. A prefix definition begins with the prefix name, followed by a = symbol, followed by a C identifier that is the new prefix, and terminated by a semicolon. The following example shows all of the prefix names, and their default values:

%prefixes%
type      = ZT ;
function  = ZR ;
label     = ZL ;
input     = ZI ;
output    = ZO ;
terminal  = ZB ;

4.3. The persistent section

sid supports passing local variables through rules in the grammar, which are eventually passed on to actions. This helps keep the generated parser thread-safe, since each variable may be passed through from the entry point. However in practise, often grammars tend to build up a structure which is conceptually global to all rules (commonly some sort of parse tree under construction). To pass this through each rule in the grammar and on to all actions is certainly possible, but a little inconvenient:

rule: ( l1 : ParsetreeT, ... ) -> ( ... ) = {
	...
}

However, adding this declaration to each rule and action would be tiresome and error-prone. Instead, it reads more naturally to view this variable as if it were global. This keeps it out of the grammar entirely, as it is a concept specific to the action file. Persistent variables provide a mechanism to automate this process of passing-through as described above, whilst leaving the grammar file untouched. (This could be done by hand, though it would require passing variables through rules in the grammar; hence persistent variables are not necessary, but merely nice to have.)

From a user's perspective, persistent variables act as globals specific to each invocation. They are accessible by every rule and every parsing instance. Since they originate from an entry point, they persist only for each invocation of an entry into the parser.

Persistent variables are declared in their own section. This section is optional.

%persistents%

	pv1 : Type1T ;
	pv2 : Type2T ;

The persistent variables declared may be used in actions in the same manner as actions' parameters:

<append-node> : ( l1 : Type3 ) -> ( ) = @{
	f ( @pv1, @pv1, @l1 ) ;
@} ;

These are passed in at the entry point to the parser.

Since the ADVANCE_LEXER macro is expanded inside generated functions that represent rules, it too may access persistent variables, as they are in scope in all rules.

4.4. The maps section

The section that follows the prefixes section is the maps section. This section is also optional. It begins with its section header, followed by a list of identifier mappings. An identifier mapping begins with a sid identifier (either a type, a rule or a terminal), followed by the -> symbol, followed by the C identifier it is to be mapped to, and terminated by a semicolon. An example follows:

%maps%
NumberT    -> unsigned ;
calculator -> calculator ;

Note that it is not possible to map type identifiers to be arbitrary C types. It will be necessary to typedef or macro define the type name in the C file.

It is recommended that all types, terminals and entry point rules have their names mapped in this section, although this is not necessary. If the names are not mapped, they will have funny names in the rest of the program.

4.5. The header section

After the maps section comes the header section. This begins with the section header, followed by a code block, followed by a comma, followed by a second code block, and terminated with a semicolon. The first code block will be inserted at the beginning of the generated parser file; the second code block will be inserted at the start of the generated header file. An example is:

%header% @{
	#include "lexer.h"

	LexerT token ;

	#define CURRENT_TERMINAL token.t
	#define ADVANCE_LEXER    next_token ()

	extern void terminal_error () ;
	extern void syntax_error () ;
@}, @{
@} ;

4.6. The assignments section

The assignments section follows the header section. This section is optional. Normally, assignment between two identifiers will be done using the C assignment operator. In some cases this will not do the correct thing, and it is necessary to do the assignment differently. All types for which this applies should have an entry in the assignments section. The section begins with its header, followed by definitions for each type that needs its own assignment operator. Each definition should have one parameter, and one result. The action's name should be the name of the type. An example follows:

%assignments%

ListT : ( l1 ) -> ( l2 ) = @{
	if ( @l2.head = @l1.head ) {
		@l2.tail = @l1.tail ;
	} else {
		@l2.tail = &( @l2.head ) ;
	}
@} ;

If a type has an assignment operator defined, it must also have a parameter assignment operator type defined and a result assignment operator defined (more precisely it must have either no assignment operations defined, or all three assignment operations defined).

4.7. The parameter assignments section

The parameter assignments section is very similar to the assignments section (which it follows), and is also optional. If a type has an assignment section entry, it must have a parameter assignment entry as well.

The parameter assignment operator is used in function calls to ensure that the object is copied correctly: if no parameter assignment operator is provided for a type, the standard C call by copy mechanism is used; if a parameter assignment operator is provided for a type, then the address of the object is passed by the calling function, and the called function declares a local of the same type, and uses the parameter assignment operator to copy the object (this should be remembered when passing parameters to entry points that have arguments of a type that has a parameter assignment operator defined).

The difference between the parameter assignment operator and the assignment operator is that the parameter identifier to the parameter assignment operator is a pointer to the object being manipulated, rather than the object itself. An example reference assignment section is:

%parameter-assignments%

ListT : ( l1 ) -> ( l2 ) = @{
	if ( @l2.head = @l1->head ) {
		@l2.tail = @l1->tail ;
	} else {
		@l2.tail = &( @l2.head ) ;
	}
@} ;

4.8. The result assignments section

The result assignments section is very similar to the assignments section and the parameter assignments section (which it follows), and is also optional. If a type has an assignment section entry, it must also have a result assignment entry. The only difference between the two is that the result identifier of the result assignment operation is a pointer to the object being manipulated, rather than the object itself. Result assignments are only used when the results of a rule are assigned back through the reference parameters passed into the function. An example result assignment section is:

%result-assignments%

ListT : ( l1 ) -> ( l2 ) = @{
	if ( @l2->head = @l1.head ) {
		@l2->tail = @l1.tail ;
	} else {
		@l2->tail = &( @l2->head ) ;
	}
@} ;

4.9. The terminal result extraction section

The terminal result extraction section follows the reference assignment section. It defines how to extract the results from terminals. The section begins with its section header, followed by the terminal extraction definitions.

There must be a definition for every terminal in the grammar that returns a result. It is an error to include a definition for a terminal that doesn't return a result. The result of the definition should be the same as the result of the terminal. An example of the terminal result extraction section follows:

%terminals%

number : () -> ( n ) = @{
	@n = token.u.number ;
@} ;

identifier : () -> ( i ) = @{
	@i = token.u.identifier ;
@} ;

string : () -> ( s ) = @{
	@s = token.u.string ;
@} ;

4.10. The action definition section

The action definition section follows the terminal result extractor definition section. The format is similar to the previous sections: the section header followed by definitions for all of the actions. An action definition has the following form:

<action-name> : ( parameters ) -> ( results ) = code-block ;

This is similar to the form of all previous definitions, except that the name is surrounded in angle brackets. What follows is also true of the other definitions as well (unless they state otherwise).

The action-name is a sid identifier that is the name of the action being defined; parameters is a comma separated list of C identifiers that will be the names of the parameters passed to the action, and results is a comma separated list of C identifiers that will be the names of the result parameters passed to the action. The code-block is the C code that defines the action. It is expected that this will assign a valid result to each of the result identifier names.

The parameter and result tuples have the same form as in the language independent file, except that the types are optional. Like the language independent file, if the type of an action is zero-tuple to zero-tuple, then the type can be omitted, e.g.:

<action> = @{ /* .... */ @} ;

An example action definition section is:

%actions%

<add> : ( v1, v2 ) -> ( v3 ) = @{
	@v3 = @v1 + @v2 ;
@} ;

<subtract> : ( v1 : NumberT, v2 : NumberT ) -> ( v3 : NumberT ) = @{
	@v3 = @v1 - @v2 ;
@} ;

<multiply> : ( v1 : NumberT, v2 ) -> ( v3 ) = @{
	@v3 = @v1 * @v2 ;
@} ;

<divide> : ( v1, v2 ) -> ( v3 : NumberT ) = @{
	@v3 = @v1 / @v2 ;
@} ;

<print> : ( v ) -> () = @{
	printf ( "%u\n", @v ) ;
@} ;

<error> = @{
	fprintf ( stderr, "ERROR\n" ) ;
	exit ( EXIT_FAILURE ) ;
@} ;

Do not define static variables in action definitions; if you do, you will get unexpected results. If you wish to use static variables in actions definitions, then define them in the header block.

4.11. The trailer section

After the action definition section comes the trailer section. This has the same form as the header section. An example is:

%trailer% @{
	int main ()
	{
		next_token () ;
		calculator ( NULL ) ;
		return 0 ;
	}
@}, @{
@} ;

The code blocks will be appended to the generated parser, and the generated header file respectively.