2. Structural Organisation
- 2.1. Source code modules
- 2.2. Type system
- 2.2.1. Primitive types
- 2.2.2.
CV_SPEC
- 2.2.3.
BUILTIN_TYPE
- 2.2.4.
BASE_TYPE
- 2.2.5.
INT_TYPE
- 2.2.6.
FLOAT_TYPE
- 2.2.7.
CLASS_INFO
- 2.2.8.
CLASS_USAGE
- 2.2.9.
CLASS_TYPE
- 2.2.10.
GRAPH
- 2.2.11.
VIRTUAL
- 2.2.12.
ENUM_TYPE
- 2.2.13.
TYPE
- 2.2.14.
DECL_SPEC
- 2.2.15.
HASHID
- 2.2.16.
QUALIFIER
- 2.2.17.
IDENTIFIER
- 2.2.18.
MEMBER
- 2.2.19.
NAMESPACE
- 2.2.20.
NAT
- 2.2.21.
FLOAT
- 2.2.22.
STRING
- 2.2.23.
NTEST
- 2.2.24.
RMODE
- 2.2.25.
EXP
- 2.2.26.
OFFSET
- 2.2.27.
TOKEN
- 2.2.28.
INSTANCE
- 2.2.29.
ERROR
- 2.2.30.
VARIABLE
- 2.2.31.
LOCATION
- 2.2.32.
POSITION
- 2.2.33.
BITSTREAM
- 2.2.34.
BUFFER
- 2.2.35.
OPTIONS
- 2.2.36.
PPTOKEN
This section describes the basic organisation of the source code for the C++ producer. This includes the division of the code into separate modules and the type system conventions.
2.1. Source code modules
For convenience, the source code is divided between a number of directories:
- base/
-
The base directory contains only the module containing the main function, the basic type descriptions and the Makefile.
- obj_c/
- obj_tok/
- obj_templ/
-
The directories
obj_c
andobj_tok
contain respectively the C and#pragma token
headers generated from the type algebra by calculus. The directoryobj_templ
contains certain calculus template files. - utility/
-
Routines for such utility operations as memory allocation and error reporting, including the error catalogue.
- parse/
-
Routines concerned with parsing and preprocessing the input, including the sid grammar.
- construct/
-
Routines for building up and analysing the internal representation of the parsed code.
- output/
-
Routines for outputting the internal representation in various formats including as a TDF capsule (see TDF Specification), a C++ spec file (see C/C++ Producer Implementation), or a tdfc2dump symbol table dump file.
Each module consists of a C source file, file.c say, containing function definitions, and a corresponding header file file.h containing the declarations of these functions. The header is included within its corresponding source file to check these declarations; it is protected against multiple inclusions by a macro of the form file_INCLUDED. The header contains a brief comment describing the purpose of the module; each function in the source file contains a comment describing its purpose, its inputs and its output.
The following table lists all the source modules in the C++ producer with a brief description of the purpose of each:
Module | Directory | Purpose |
---|---|---|
access | construct | member access control |
allocate | construct | new and delete expressions |
assign | construct | assignment expressions |
basetype | construct | basic type operations |
buffer | utility | buffer reading and writing routines |
c_class | obj_c | calculus support routines |
capsule | output | top-level TDF encoding routines |
cast | construct | cast expressions |
catalog | utility | error catalogue definition |
char | parse | character sets |
check | construct | expression checking |
chktype | construct | type checking |
class | construct | class and enumeration definitions |
compile | output | TDF tag definition encoding routines |
constant | parse | integer constant evaluation |
construct | construct | constructors and destructors |
convert | construct | standard type conversions |
copy | construct | expression copying |
debug | utility | development aids |
declare | construct | variable and function declarations |
decode | output | bitstream reading routines |
derive | construct | base class graphs; inherited members |
destroy | construct | garbage collection routines |
diag | output | TDF diagnostic output routines |
dump | output | symbol table dump routines |
encode | output | bitstream writing routines |
error | utility | error output routines |
exception | construct | exception handling |
exp | output | TDF expression encoding routines |
expression | construct | expression processing |
file | parse | low-level I/O routines |
function | construct | function definitions and calls |
hash | parse | hash table and identifier name routines |
identifier | construct | identifier expressions |
init | output | TDF initialiser expression encoding routines |
initialise | construct | variable initialisers |
instance | construct | template instances and specialisations |
inttype | construct | integer and floating point type routines |
label | construct | labels and jumps |
lex | parse | lexical analysis |
literal | parse | integer and string literals |
load | output | C++ spec reading routines |
macro | parse | macro expansion |
main | – | main routine; command-line arguments |
mangle | output | identifier name mangling |
member | construct | member selector expressions |
merge | construct | intermodule merge routines |
namespace | construct | namespaces; name look-up |
operator | construct | overloaded operators |
option | utility | compiler options |
overload | construct | overload resolution |
parse | parse | low-level parser routines |
pragma | parse | #pragma directives |
predict | parse | parser look-ahead routines |
preproc | parse | preprocessing directives |
utility | error argument printing routines | |
quality | construct | extra expression checks |
redeclare | construct | variable and function redeclarations |
rewrite | construct | inline member function definitions |
save | output | C++ spec writing routines |
shape | output | TDF shape encoding routines |
statement | construct | statement processing |
stmt | output | TDF statement encoding routines |
struct | output | TDF structure encoding routines |
syntax[0-9]* | parse | sid parser output |
system | utility | system dependent routines |
table | parse | portability table reading |
template | construct | template declarations and checks |
throw | output | TDF exception handling encoding routines |
tok | output | TDF standard tokens encoding |
tokdef | construct | token definitions |
token | construct | token declarations and expansion |
typeid | construct | run-time type information |
unmangle | output | identifier name unmangling |
variable | construct | variable analysis |
virtual | construct | virtual functions |
xalloc | utility | memory allocation routines |
2.2. Type system
This section describes the type system used in the C++ producer. Unless otherwise stated the types are declared using the calculus tool as part of the algebra, c_class.alg
. The design of this type algebra was clearly largely based on the concepts underlying the C++ language; however TDF provided an important influence, not merely as the intended target language, but also because of its clear presentation of essential language features.
2.2.1. Primitive types
The primitive types used within the algebra c_class
are defined as follows:
int = "int" ; unsigned = "unsigned" ; string = "character *" ; ulong_type (ulong) = "unsigned long" ; BITSTREAM_P (bits) = "BITSTREAM *" ; PPTOKEN_P (pptok) = "PPTOKEN *" ;
The integral types are self-explanatory. All string literals used in the C++ producer are based on the character type:
typedef unsigned char character ;
hence the definition of string
. The remaining primitive give links to those portions of the type system which are defined outside of the algebra. The types BITSTREAM
and PPTOKEN
are described below.
2.2.2. CV_SPEC
The enumeration type CV_SPEC
(short name cv
) is used to represent a C++ type qualifier. It takes the form of a bitfield, the elements of which can be or-ed together to represent combinations of type qualifiers. The cv-qualifiers are represented by cv_const
and cv_volatile
in the obvious manner. The value cv_lvalue
is used as a qualifier to indicate whether a type is an lvalue or an rvalue. Other values are used in function types to represent the function language linkage.
2.2.3. BUILTIN_TYPE
The enumeration type BUILTIN_TYPE
(ntype
) is used to represent the built-in C++ types (char
, float
, void
etc.). It is used chiefly as an index into tables of type information.
2.2.4. BASE_TYPE
The enumeration type BASE_TYPE
(btype
) is used to represent a C++ simple type specifier such as signed
, short
or int
. It takes the form of a bitfield, the elements of which can be or-ed together to represent combinations of type specifiers. Its chief use is when reading a type from the input file; the various simple type specifiers are combined to give a value of this type, which is then mapped to an actual C++ type.
2.2.5. INT_TYPE
The union type INT_TYPE
(itype
) is used to represent an integral or bitfield C++ type. The basic integral types are given by the basic
field. Bitfield types are represented by the bitfield
field. There are also fields representing target dependent integral promotion, arithmetic and integer literal types, plus VARIETY
tokens. Only one INT_TYPE
object is created for each integral type.
2.2.6. FLOAT_TYPE
The union type FLOAT_TYPE
(ftype
) is used to represent a floating point C++ type. The basic floating point types are given by the basic
field. There are also fields representing target dependent argument promotion and arithmetic types, plus FLOAT
tokens. Only one FLOAT_TYPE
object is created for each floating point type.
2.2.7. CLASS_INFO
The enumeration type CLASS_INFO
(cinfo
) is used to represent information relating to a class or enumeration definition. It takes the form of a bitfield, the elements of which can be or-ed together to represent various combinations of properties.
2.2.8. CLASS_USAGE
The enumeration type CLASS_USAGE
(cusage
) is used to represent information relating to the way a class is used. It takes the form of a bitfield, the elements of which can be or-ed together to represent various combinations of properties.
2.2.9. CLASS_TYPE
The union type CLASS_TYPE
(ctype
) is used to represent a C++ class or union. The main components are an identifier giving the class name, class information and class usage fields, a namespace giving the class members, a graph representing the base class structure, and a virtual function table. Only one CLASS_TYPE
object is created for each class or union.
Each class maintains a list, pals
, of class and function identifiers which are declared as friends of that class. It also maintains a list, chums
, of those class types which declare it to be a friend (this is what is actually used in the access checks). Similarly each function identifier maintains a list, chums
, of those class types which declare it to be a friend.
Each class maintains a list of its constructors, destructors and conversion functions (included inherited conversion functions). It also maintains a list of its virtual base classes. This information can be obtained by other means but it is more convenient to record it within the class type itself.
2.2.10. GRAPH
The union type GRAPH
(graph
) is used to represent a directed acyclic graph arising from the base classes of a class. Each node of the graph has a head
which is a class type, and several tails
which give the base class graphs for that class. Each node has pointers, top
, to the top of the graph (i.e. the most derived class), and up
, to the node of which the current node is a direct base. Each node also has an access
field which gives information on the base access, whether it is virtual or not, and so on, in the form of a DECL_SPEC
. Virtual bases are handled by the equal
field which defines an equivalence relation on the graph which identifies equivalent virtual bases.
2.2.11. VIRTUAL
The union type VIRTUAL
(virt
) is used to represent the virtual functions declared in a class. The table
field is used to represent a virtual function table, and consists primarily of a list of VIRTUAL
objects giving the virtual functions for the associated class. These virtual functions are of four kinds, each represented by a union field. A virtual function first declared in a class is represented by the simple
field; a virtual function in a class which overrides an inherited virtual function is represented by the override
field; an inherited, non-overridden virtual function which is not overridden in a base class is represented by the inherit
field; a inherited, non-overridden virtual function which is overridden in some base class is represented by the complex
field.
2.2.12. ENUM_TYPE
The union type ENUM_TYPE
(etype
) is used to represent a C++ enumeration type. This consists primarily of an identifier giving the enumeration name, a class information field, a type giving the underlying representation of the enumeration type, and a list of identifiers giving the enumerators comprising the enumeration.
2.2.13. TYPE
The union type TYPE
(type
) is used to represent a C++ type. Every type has an associated type qualifier, qual
, which determines whether the type is const
, volatile
or an lvalue. A type may also have an associated identifier, name
, giving the corresponding type name (the null identifier being used for unnamed types). The other type components are determined by the union tag. Each of the type constructs above has a corresponding field in the TYPE
union: integer
for integral types, floating
for floating point types, bitfield
for bitfield types, compound
for class or union types, and enumerate
for enumeration types. There are also fields top
and bottom
corresponding to void
and bottom (the type used to represent values which never return).
Other fields of the TYPE
union represent composite types; for example, the array
field, representing array types, comprises a base type, sub
, and an integer constant giving the array bound, size
. These are generally simple, apart from func
, representing a function type. This has the obvious components: a return type, ret
, a list of parameter types, ptypes
, and a flag indicating ellipsis functions, ellipsis
. It also has an associated namespace, pars
, in which the function parameters are declared. The parameter identifiers are extracted from this as a list, pids
. Member function qualifiers and language linkage information are represented by a CV_QUAL
, mqual
. The implicit extra parameter for member functions is recorded in the list mtypes
, which adds this extra type to the start of ptypes
. Finally except
gives any exception specifiers; the case where the exception specifier is absent being represented by the special value, univ_type_set
.
2.2.14. DECL_SPEC
The enumeration type DECL_SPEC
(dspec
) is used to represent information on the declaration and usage of an identifier. It takes the form of a bitfield, the elements of which can be or-ed together to represent various combinations of properties. The 32 bits in this bitfield (the maximum which can be represented portably) are a significant restriction. This means that the same member of DECL_SPEC
is often used to mean different things in different contexts. This can prove confusing on occasions.
2.2.15. HASHID
The union type HASHID
(hashid
) is used to represent a C++ identifier name. The simplest form of identifier name, name
, consists of just a string of characters, such as foo
. Extended identifier names, ename
, are similar, but may contain Unicode characters. There are however other forms of identifier name in C++: conversion function names (conv
) such as operator int
, overloaded operator names (op
) such as operator+
, constructor names (constr
), and destructor names (destr
). There are also names which are used for anonymous identifiers (anon
).
Note the distinction between an identifier name and an actual identifier. The latter is a meaning associated with a name in a particular context. Every identifier name has an associated underlying meaning, id
. This is used to handle keywords and macros, but for most identifier names this will be a dummy identifier. Nested underlying meanings (such as a macro hiding a keyword) are handled by linking the alias
fields of the corresponding identifiers. Every identifier name also has a cache
field which is used to record the look-up of this name as an unqualified identifier. This may be set to the null identifier to indicate that the look-up needs to be re-evaluated.
Identifier names are stored in one of a small number of hash tables, linked using their next
field. Each name has only one entry in these tables, allowing equality of names to be implemented as EQ_hashid
.
2.2.16. QUALIFIER
The enumeration type QUALIFIER
(qual
) is used to represent the various ways in which an identifier name can be qualified. For example, ::A::a
is represented by qual_full
. The value qual_mark
is used in the representation of function identifier expressions to indicate that overload resolution has been performed.
2.2.17. IDENTIFIER
The union type IDENTIFIER
(id
) is used to represent the various kinds of C++ identifiers. Every identifier has an associated identifier name, a parent namespace, a declaration information field, and a location for its declaration or definition. Each identifier also has an alias
field which is normally used to represent the aliasing which can occur in inheritance or using
declarations.
The various fields of the IDENTIFIER
union correspond to the various kinds of identifier which can arise in C++ - class names, functions, variables, class members, macros, keywords etc. Each field has appropriate components giving its type, its definition or whatever other information is required. For example, the variable
field has a type and two expressions, giving the constructor and destructor values for the object.
Most of these identifier components are self-explanatory, however the treatment of overloaded functions bears discussion. The various fields representing functions have an over
component which is used to link overloaded functions together. A set of overloaded functions is treated as if it were a single IDENTIFIER
- the first in the list - for the purposes of storing in a namespace member; the other overloaded meanings are accessed by chasing down the over
components. In other situations, whether a function identifier represents a single function or a set of overloaded functions can be worked out from the context. For example, in identifier expressions the identifier qualifier is used to mark whether overload resolution has taken place.
2.2.18. MEMBER
The union type MEMBER
(member
) is used to represent a member of a namespace. Each member contains two identifiers, id
and alt
. The id
field gives the meaning associated with a particular name in this namespace; the alt
field is used to represent a type name which may be hidden by a non-type name.
There are two kinds of member, small
and large
, corresponding to whether the namespace holds its members in a simple linked list or in a hash table.
2.2.19. NAMESPACE
The union type NAMESPACE
(nspace
) is used to represent the set of identifiers declared in a particular scope. For example, the members declared in a C++ class or namespace, the parameters declared in a function declarator and the local variables declared in a block all form scopes. The various kinds of scope are distinguished as different fields of the union, but there are basically two categories. The first, such as function blocks, which have relatively small numbers of elements, store their members as a simple linked lists. The second, such as classes, which have larger numbers of elements, store their members in hash tables. In both cases the elements are stored using the MEMBER
type.
The key operation on a namespace is to look up a particular identifier name in its linked list or hash table of members to find the meaning, if any, associated with that name in the namespace. This can be a complex operation because of the need to take base classes and using
directives (as stored in the use
component) into account.
2.2.20. NAT
The union type NAT
(nat
) is used to represent an integer constant expression. Values are represented as lists of 16 bit 'digits'. Values which fit into a single digit are represented by the small
field; larger values by the large
field. Negated values can be represented by the neg
field. Folding of integer constant expressions is performed in the producer, however the result can only be represented as described above if its value is target independent. Target dependent values are represented by the calc
field which contains an expression describing how to calculate the value. The token
field is used to represent NAT
tokens.
Objects representing small integer constants are created at the start of the program and stored in a table for ease of access. Larger constants are created as and when they are required.
2.2.21. FLOAT
The union type FLOAT
(flt
) is used to represent a floating point constant expression. There is only one field, simple
, which corresponds to a floating point literal. No folding of floating point constant expressions is attempted in the producer (it is virtually impossible to do so in a target independent manner).
Objects representing useful floating point constants (0.0, 1.0 etc.) are created for each floating point type and stored as part of the corresponding FLOAT_TYPE
. Other values are created as and when they are required.
2.2.22. STRING
The union type STRING
(str
) is used to represent a string constant expression. There is only one field, simple
, which corresponds to a character string literal, however the kind
field can be used to modify the interpretation put on the characters appearing in the text
field. By default, each character in text
corresponds to a single character in the literal; however an alternative representation, in which text
consists of a sequence of multibyte characters - one control character plus four value characters - is used in more complex cases.
All strings are stored in a hash table intended to ensure that the same STRING
object is used for equal string literals. This not only saves space during the processing of the input file, but also facilitates the output of shared string literals in the TDF capsule.
Note that the terminal zero character does not form part of the STRING
object. Instead information on this is stored as part of the type of a string literal expression. The text of the string literal is either truncated or padded with zeros until its length matches the size of the array bound in the type of the corresponding literal expression.
2.2.23. NTEST
The enumeration type NTEST
(ntest
) is used to represent the various C++ relational operators (==
, !=
, >
etc.). The values correspond to the encoding of the TDF NTEST
sort, which facilitates code generation. The values also have the property that the values for complementary operators (such as <
and >=
) always add up to the same value, ntest_negate
, allowing operators to be complemented in a straightforward manner.
2.2.24. RMODE
The enumeration type RMODE
(rmode
) is used to represent the various C++ rounding modes (towards zero, towards smaller etc.). The values correspond to the encoding of the TDF RMODE
sort, which facilitates code generation.
2.2.25. EXP
The union type EXP
(exp
) is used to represent a C++ expression or statement. Each expression has an associated type, type
, but most of the information about an expression is stored in one of the large number of fields of the EXP
union. Most of these fields are fairly simple. For example, there are fields corresponding to integer literals, floating point literals, string literals and identifiers. Composite expressions are formed in the normal way; for example, there are various binary operators comprising two argument expressions. The EXP
fields corresponding to statements are slightly more complex. They each have a parent
field which points to the enclosing statement. A couple of cases bear additional discussion.
The sequence
field represents a compound statement or block. This contains a namespace, in which any local variables are declared, and a list of expressions, giving the statements comprising the block. The null namespace is used if the block does not constitute a scope. The first statement in the list is always a dummy to enable first
and last
pointers to be maintained to the start and end of the list without having to worry about null lists.
The solve_stmt
field corresponds to the TDF labelled
construct (in early versions of TDF this construct was called solve
, hence the terminology). The problem is that C and C++ labels and goto
s are totally unstructured, whereas the TDF label constructs are structured. Any statement which contains unstructured labels is enclosed in a solve_stmt
construct, enclosing both the labelled statement and all jumps to it (in general this cannot be done until the end of the function). Any labels or variables which are bypassed by such unstructured jumps also need to be pulled out to the solve_stmt
construct. It is not just explicit labels which can cause such problems; complex switch
statements have the same effect.
2.2.26. OFFSET
The union type OFFSET
(off
) is used to represent an offset expression. This is used as an adjunct to the normal expression representation. The OFFSET
union has fields corresponding to a type offset (used in pointer arithmetic), the offset of a member of a class and the offset of a base class. There are also simple operations on offsets, such as multiplication by an expression.
2.2.27. TOKEN
The union type TOKEN
(tok
) is used to represent one of a number of different categories within the C++ language. It corresponds to the sort of a token declared using the tdfc2pragma #pragma token
syntax. Thus there are fields corresponding to expression, statement, integer constant, type, function, member and procedure tokens. The similarities between PROC
tokens and templates have been remarked above; for example, the parameters of the template:
template < class T, int n > class A { T a [n] ; // .... } ;
are essentially equivalent to those in the procedure token:
PROC ( TYPE T, EXP const : int : n ) ....
(recall that non-type template arguments are always constant expressions). Thus a field, templ
, of the TOKEN
union is used to represent lists of template parameters. Note that a further field, class
, is also required to represent template template parameters. A template type is represented by a field, templ
, of the union TYPE
, which comprises a template sort and a sub-type expressed in terms of the template parameters.
In addition to representing token and template sorts in this way, the TOKEN
union is used to represent token and template arguments. Each of the parameter sorts listed above has an appropriate value
component which can store a value of that sort. Many of the union types in the algebra, including types and expressions, have a field of the form:
token -> { IDENTIFIER tok ; LIST TOKEN args ; }
representing the given token identifier applied to the given list of arguments.
Template instances are represented slightly differently from token applications. Each instance of a template class or a template function gives rise to a new class or function identifier. This identifier has an underlying form giving the template identifier and the template arguments. This is expressed as a token
member of the TYPE
union (although it is not technically a type, this happens to be the most convenient representation). Each such form has an associated INSTANCE
component which gives further information about the template instance. The form for a template function instance is stored in the form
component of the corresponding identifier. The form for a template class instance is stored in the form
component of the corresponding class type.
Members of instances of template classes also have a form type, but in this case the form is an instance
type. This gives a link back to the corresponding member of the template class.
2.2.28. INSTANCE
The union type INSTANCE
(inst
) is used to represent a particular instance of a template or token. Each template sort has an associated list of all the instances of that template, which is used to ensure that the same template applied with the same arguments always has the same value. Information on partial or explicit specialisations and usage information are stored as part of the corresponding INSTANCE
. Each template instance identifier has a link back to its corresponding INSTANCE
via its form
component.
2.2.29. ERROR
The union type ERROR
(err
) is used to represent an error arising during the compilation of a C++ program. Errors are first class objects within the producer and can be passed to and from procedures. Each error has an associated severity
(serious, warning, none etc.). Simple errors are represented by the simple
field, which consists of an index, number
, into the error catalogue, plus a variable length list of error arguments. Errors can be combined into composite errors using the compound
field, which represents the join of two errors - head
followed by tail
.
The chief operation on an error after it has been built up is to report it. Each error report consists of an error object and a file location indicating where the error occurred.
2.2.30. VARIABLE
The structure type VARIABLE
(var
) is used to represent a variable state and is used in the variable analysis checks.
2.2.31. LOCATION
The structure type LOCATION
(loc
) is used to represent a location in an input file. It comprises a pointer to an input file position, posn
, modified by a line number, taking #line
directives into account, line
. Note that character positions within the line are not currently recorded.
2.2.32. POSITION
The structure type POSITION
(posn
) is used to represent a position in an input file. It consists of two file names, file
taking #line
directives into account, and input
giving the actual file name, plus a line number offset, offset
, which gives the difference between the line number taking #line
directives into account and the actual line number. Other information stored includes the datestamp on the input file, datestamp
, and a pointer to a file location which, for files included using #include
, gives the location the file was included from.
2.2.33. BITSTREAM
The structure BITSTREAM
is not part of the calculus
type system. It is used to represent a sequence of bits such as is used, for example, in the encoding of TDF.
2.2.34. BUFFER
The structure BUFFER
is not part of the calculus
type system. It is used to represent a sequence of characters.
2.2.35. OPTIONS
The structure OPTIONS
is not part of the calculus
type system. It is used to represent the state of the tdfc2pragma compiler options at a particular point in the input file.
2.2.36. PPTOKEN
The structure PPTOKEN
is not part of the calculus
type system. It is used to represent a linked list of preprocessing tokens. Each token has an associated sid
lexical token number, tok
, plus additional data dependent on the token type. Each token also records a pointer to the current OPTIONS
value.