The SID Developers' Guide
- i. Introduction
- 1. sid's Organisation
- 2. The Internal Representation of a Grammar
- 3. The Representation of an Action Code
- 4. The Scope System
- 5. API Organisation
© , The TenDRA Project.
First published .
Revision History
kate | Moved out much of the datastructure comments to the code itself. This reverse engineering better serves its purpose kept with the code, so that it may evolve as the code changes. The developer guide (now slightly spartan) can grow anew giving structual and organisational information rather than a field-by-field analysis. | |
kate | Initial import (and conversion from LaTeX) of the SID Developer Guide, researched and written by Kevin. |
i. Introduction
sid is an LL(1) parser. It reads a grammar specification (documented in the sid users' guide) then outputs a parser.
In this documentation, we explain sid's internal representation of grammars. It was reverse engineered from the code itself and the accompanying comments. Most of the reverse engineering was done reading sid's own grammar file and most notably the actions in the parser.act file. This document does not explain in detail how sid parses the .sid and the .act files.
[TODO this guide has been gutted; what remains gives a skeletal outline which will serve as a tour of the use and contents of the internal structures, without concerning itself with the implementation minuate. (Comments by the structures and API routines serve more accurately there, and will not fall out of sync with the code so easily). Here we can state how they all fit together, and what one might do with the data. dumping implementation to graphviz would be great, too.]
ii. Prerequisites
Before reading this guide, you should have read the sid users' guide and be able to use sid to output a simple parser.
iii. Who Should Read This Document?
Anyone interested in the post-processing steps sid does to a grammar after having parsed the .sid and the .act files. In particular, if you want to work on the grammar transforms and improve them you need to know sid's internal representations of grammars or at least know the interface to the internal representation. You may not want to reverse engineer this representation yourself.
1. sid's Organisation
- 1.1. The
main
Function - 1.2. Adding a New Output Language:
main_language_list
- 1.3. Code Organisation and Conventions
When you call sid, the main operations it performs are
-
Reads the grammar .sid file and stores its internal representation.
-
Reads the grammar output language specific .act file and complete the representation of the grammars with the action code. (After this step, sid only works on the internal representation.)
-
Transforms and Optimises the Grammar. Most notably, it removes left recursion and tries to transform the context free grammar provided in an equivalent LL(1) grammar.
-
Outputs the parser.
1.1. The main
Function
Let's look at some parts of the code in main.c. In main
, you can see HANDLE
and WITH
: these are macros that emulate an exception mechanism in C. These come from libexds.
The function main
itself doesn't do much: it initializes some structures, calls main_init
to process the command line options then calls main_1
to do all the interesting work. Now look at the function main_1
, it calls in order (forgetting about all the initialisation stuff and the error handling stuff)
sid_parse_grammar()
-
parses the .sid file and converts it into an internal representation.
grammar_check_complete(&grammar)
-
verifies that all rules are accessible.
(*(main_language->input_proc))(output_closure, &grammar)
-
parses the action file and completes the internal representation of the grammar.
grammar_remove_left_recursion(&grammar)
-
TODO
grammar_compute_first_set(&grammar)
-
computes the first set of each rule in the grammar.
grammar_factor(&grammar)
-
TODO
grammar_simplify(&grammar)
-
TODO
grammar_compute_inlining(&grammar)
-
TODO
grammar_check_collisions(&grammar)
-
TODO
grammar_recompute_alt_names(&grammar)
-
TODO
(*(main_language->output_proc))(output_closure, &grammar)
-
outputs the parser in the chosen language.
You may wonder what main_language
is. We explain it in the next section.
1.2. Adding a New Output Language: main_language_list
The global variable main_language
allows us to easily modify the output language. It's a pointer to a structure called LangListT
. This pointer will always point to an element in the table main_language_list
, which contains callbacks for the various stages of processing.
The first member indicates the option name. The second one is a pointer to the initialisation function. The third one contains a pointer to the top input routine for the action file. The fourth one is an integer indicating the number of input file (2 for outputting C, 1 for test). Then we have the top output language specific output function and finally the number of outputted file. Don't remove the last line: it serves as a guard. To add a new output language, add a line to main_language_list
and implement the new top level functions.
1.3. Code Organisation and Conventions
If you read this guide, it is probably because you want to modify sid. In this section, we say how sid is organised and how one should modify the code to keep the code readable.
sid defines many types for the internal representation of a grammar. These types are defined in the header files, begins with a majuscule and ends with T, e.g. RuleT
. If a type, MytypeT
is declared in myfile.h, then any function that directly touches the members of MytypeT
begins with mytype_
and is defined in file.c. No other function should touch MytypeT
directly. If you want to access an object of a certain type, do not access its members directly. Instead, use the interface declared in the header (the same header where the type is declared).
2. The Internal Representation of a Grammar
- 2.1. The
GrammarT
Type and Structure - 2.2. The
TableT
Type - 2.3. Internal Representation of Actions
- 2.4. Internal Representation of Rules
- 2.5. Internal Representation of Non-Locals
We present the internal representation in a top down structure. Many functions are provided that hides the implementation; all access to these structures is performed through their respective APIs. Those interfaces are not documented here; rather, a brief functional tour of content is given instead.
The implementation details behind the interfaces are documented in their respective datastructures; see the various header files for details per-field. Virtually all of these are stored in the adt directory.
2.1. The GrammarT
Type and Structure
sid stores grammars in a structure called GrammarT
.
It contains a TableT
hash map containing elements of type EntryT
. We'll talk about EntryT
later. For the moment remember that an EntryT
is a structure that can be used to store either an action, a rule, a type, a terminal, or a nonlocal.
Please note that in the code terminals are referred as basics in the code, probably for historical reasons.
2.2. The TableT
Type
It holds most of the information about a grammar, rules, actions, terminals, non-local names, types and local names along with their scopes.
Its definition is very simple, containing only a hash table of EntryT
.
2.3. Internal Representation of Actions
Actions are stored in a ActionT
. More exactly, they are stored in an EntryT
whose union member is an ActionT
.
2.4. Internal Representation of Rules
Look at an extremely simple sid rule:
basic-arithmetic = { number;plus;number; || number;minus;number; };
It has two alternatives, each one holding three items. Alternatives have their special type: AltT
.
It also contains a list of ItemT
. In our previous example, the list of items of the first alternate would be number
, plus
and number
.
An ItemT
is therefore a rule, an action call, a terminal or TODO(rename, name) already declared in table.
And now the RuleT
type; this structure is very long.
2.5. Internal Representation of Non-Locals
A non-local is an EntryT
whose type is ET_NON_LOCAL
and whose member u
is also an EntryT*
: the target of this pointer is a type giving the type of the non-local. The key
member is of course the name of the non-local but also contains its scope. IE, non-locals are accessible to all rules contained in the rule for which the non-local is defined. The scope is indicated in the key
string by outer-enclosing-rule::inner-enclosing-rule::name-of-non-local
.
3. The Representation of an Action Code
This depends on the output language but as only C is supported at the moment, we'll describe how the body of an action is represented.
3.1. The CCodeT
Type
To represent the body of an action when the output language is C, sid uses the type CCodeT
4. The Scope System
The scope system is only used when reading the grammar definition file (the .sid file). The only trace of it once the scope system has been deleted are the key members of the scoped rules and the always scope non-locals that begins with outer-enclosing-rule1::inner-enclosing-rule
.
5. API Organisation
- 5.1. Interface to TableT
- 5.2. Interface to EntryT
- 5.3. Interface to RuleT
- 5.4. Interface to BasicT
- 5.5. Interface to ActionT
- 5.6. Interface to NonLocalT
[TODO we'll automatically generate this using Puredocs]
Most of the implementation is hidden to the user. One should only use the following functions to access everything.
5.1. Interface to TableT
As a general rule, you add entries to a TableT by using the functions listed in adt/table.h.
5.2. Interface to EntryT
TODO
5.3. Interface to RuleT
TODO
5.4. Interface to BasicT
TODO
5.5. Interface to ActionT
TODO
5.6. Interface to NonLocalT
TODO