The Pragma Token Syntax
- 1. Introduction
- 2. Program construction using TDF
- 3. The token syntax
- 4. Token identification
- 5. Expression tokens
- 6. Statement tokens
- 7. Type tokens
- 8. Specifying conversions using the token syntax
- 9. Selector tokens
- 10. Procedure tokens
- 11. Tokens and APIs
- 12. Token syntax
© , , The TenDRA Project.
© DERA.
First published .
Revision History
kate | Merged in token syntax type conversion from C/C++ Checker Reference Manual. | |
kate | Moved out the DRA producers as a standalone tool. | |
kate | A start on a new document; the initial | |
DERA | tdfc2/tcpplus 1.8.2; TenDRA 4.1.2 release. |
1. Introduction
The token syntax is used to introduce references to program constructs such as types, expressions etc. that can be defined in other compilation modules. This can be thought of as a generalisation of function prototyping which is used to introduce references to functions defined elsewhere. The references introduced by the token syntax are called tokens because they are tokens for the program constructs that they reference. The token syntax is said to specify a token interface.
It is expected that the general user will have little direct contact with the token syntax, instead using the asbstract standard headers provided or using the tspec tool [Ref. 5] to generate their own token interface header files automatically. However, it may occasionally be necessary to use the raw power of the token syntax directly.
As an example of the power of the token syntax consider the program below:
#pragma token TYPE FILE# #pragma token EXP rvalue:FILE *:stderr# int fprintf(FILE *, const char *, ...); void f(void) { fprintf(stderr, "hello world\n"); }
The first line of the program introduces a token, FILE
, for a type. By using its identification, FILE
, this token can be used wherever a type could have been used throughout the rest of the program. The compiler can then compile this program to TDF (the abstract TenDRA machine) even though it contains an undefined type. This is fundamental to the construction of portable software, where the developer cannot assume the definitions of various types as they may be different on different machines.
The second line of the example, which introduces a token for an expression, is somewhat more complicated. In order to make use of an expression, it is necessary to know its type and whether or not it is an lvalue (i.e. whether or not it can be assigned to). As can be seen from the example however, it is not necessary to know the exact type of the expression because a token can be used to represent its type.
The TenDRA compiler makes no assumptions about the possible definitions of tokens and will raise an error if a program requires information about an undefined token. In this way many errors resulting from inadvertent use of a definition present on the developer's system can be detected. For example, developers often assume that the type FILE
will be implemented by a structure type when in fact the ISO C standard permits the implementation of FILE
by any type. In the program above, any attempt to access members of stderr would cause the compiler to raise an error.
2. Program construction using TDF
Traditional program construction using the C language has two phases: compilation and linking.
In the compilation phase the source text written in the C language is mapped to an object code format. This object code is generally not complete in itself and must be linked with other program segments such as definitions from the system libraries.
When tokens are involved there is an extra stage in the construction process where undefined tokens in one program segment are linked with their definitions in another program segment. To summarise, program construction using TDF and the TenDRA tools has four basic operations:
- Source file compilation to TDF
-
The TDF produced may be incomplete in the sense that it may contain undefined tokens;
- TDF linking
-
The undefined tokens represented in TDF are linked to their definitions in other compilation modules or libraries. During linking, tokens with the same identifier are treated as the same token;
- TDF translation
-
This is the conversion of TDF into standard object file format for a particular machine system. The program is still incomplete at this stage in the sense that it may contain undefined library functions;
- Object file linking
-
This corresponds directly to standard object file linking.
3. The token syntax
The token syntax is an extension to the ISO C standard language to allow the use of tokens to represent program constructs. Tokens can be used either in place of, or as well as, the definitions required by a program. In the latter case, the tokens exist merely to enforce correct definitions and usage of the objects they reference. However it should be noted that the presence of a token introduction can alter the semantics of a program (examples are given in §5 Expression tokens). The semantics have been altered to force programs to respect token interfaces where they would otherwise fail to do so.
The token syntax takes the following basic form:
#pragma token token-introduction token-identification
It is introduced as a pragma to allow other compilers to ignore it, though if tokens are being used to replace the definitions needed by a program, ignoring these pragmas will generally cause the compilation to fail.
The token-introduction defines the kind of token being introduced along with any additional information associated with that kind of token. Currently there are five kinds of token that can be introduced, corresponding approximately to expressions, statements, type-names, member designators and function-like macros.
The token-identification provides the means of referring to the token, both internally within the program and externally for TDF linking purposes.
4. Token identification
The syntax for the token-identification is as follows:
token identification: name-space?identifier # external-identifier? name-space: TAG
There is a default name space associated with each kind of token and internal identifiers for tokens generally reside in these default name spaces. The ISO C standard describes the five name spaces as being:
- The label space
-
in which all label identifiers reside;
- The tag space
-
in which structure, union and enumeration tags reside;
- The member name space
-
in which structure and union member selectors reside;
- The macro name space
-
in which all macro definitions reside. Token identifiers in the macro name space have no definition and so are not expanded. However, they behave as macros in all other respects;
- The ordinary name space
-
in which all other identifiers reside.
The exception is compound type-token identifiers (see §7.4 Compound type tokens) which by default reside in the ordinary name space but can be forced to reside in the tag name space by setting the optional name-space to be TAG
.
The first identifier of the token-identification provides the internal identification of the token. This is the name used to identify the token within the program. It must be followed by a #
.
All further preprocessing tokens until the end of the line are treated as part of the external-identifier with non-empty white space sequences being replaced by a single space. The external-identifier specifies the external identification of the token which is used for TDF linking. External token identifications reside in their own name space which is distinct from the external name space for functions and objects. This means that it is possible to have both a function and a token with the same external identification. If the external-identifier is omitted it is assumed that the internal and external identifications are the same.
5. Expression tokens
There are various properties associated with expression tokens which are used to determine the operations that may be performed upon them.
-
Designation is the classification of the value delivered by evaluating the expression. The three possible designations, implied by the ISO C standard, section 6.3, are:
- value
-
expression describes the computation of a value;
- object
-
expression designates a variable which may have an associated type qualifier giving the access conditions;
- function designation
-
expression designates a function.
-
Type specifies the type of the expression ignoring any type qualification;
-
Constancy is the property of being a constant expression as specified in the ISO C Standard section 6.4.
The syntax for introducing expression tokens is:
exp-token: EXP exp-storage : type-name : NAT exp-storage: rvalue lvalue const
Expression tokens can be introduced using either the EXP
or NAT
token introductions. Expression tokens introduced using NAT
are constant value designations of type int
i.e. they reference constant integer expressions. All other expression tokens are assumed to be non-constant and are introduced using EXP
.
-
The
exp-storage
is either lvalue or rvalue. If it is lvalue, then the token is an object designation without type qualification. If it is rvalue then the token is either a value or a function designation depending on whether or not its type is a function type. -
The type-name is the type of the expression to which the token refers.
All internal expression token identifiers must reside in the macro name space and this is consequently the default name space for such identifiers. Hence the optional name-space, TAG
, should not be present in an EXP
token introduction. Although the use of an expression token after it has been introduced is very similar to that of an ordinary identifier, as it resides in the macro name space, it has the additional properties listed below:
-
expression tokens cannot be hidden by using an inner scope;
-
with respect to
#ifdef
, expression tokens are defined; -
the scope of expression tokens can be terminated by
#undef
.
In order to make use of tokenised expressions, a new symbol, exp-token-name , has been introduced at translation phase seven of the syntax analysis as defined in the ISO C standard. When an expression token identifier is encountered by the preprocessor, an exp-token-name
symbol is passed through to the syntax analyser. An exp-token-name
provides information about an expression token in the same way that a typedef-name
provides information about a type introduced using a typedef
. This symbol can only occur as part of a primary-expression (ISO C standard section 6.3.1) and the expression resulting from the use of exp-token-name
will have the type, designation and constancy specified in the token introduction. As an example, consider the pragma:
#pragma token EXP rvalue : int : x#
This introduces a token for an expression which is a value designation of type int
with internal and external name x
.
Expression tokens can either be defined using #define
statements or by using externals. They can also be resolved as a result of applying the type-resolution or assignment-resolution operators (see §7.5). Expression token definitions are subject to the following constraints:
-
if the
exp-token-name
refers to a constant expression (i.e. it was introduced using the NAT token introduction), then the defining expression must also be a constant expression as expressed in the ISO C standard, section 6.4; -
if the
exp-token-name
refers to an lvalue expression, then the defining expression must also designate an object and the type of the expression token must be resolvable to the type of the defining expression. All the type qualifiers of the defining expression must appear in the object designation of the token introduction; -
if the
exp-token-name
refers to an expression that has function designation, then the type of the expression token must be resolvable to the type of the defining expression.
The program below provides two examples of the violation of the second constraint.
#pragma token EXP lvalue : int : i# extern short k; #define i 6 #define i k
The expression token i
is an object designation of type int
. The first violation occurs because the expression, 6
, does not designate an object. The second violation is because the type of the token expression, i
, is int
which cannot be resolved to the type short
.
If the exp-token-name refers to an expression that designates a value, then the defining expression is converted, as if by assignment, to the type of the expression token using the assignment-resolution operator (see §7.5). With all other designations the defining expression is left unchanged. In both cases the resulting expression is used as the definition of the expression token. This can subtly alter the semantics of a program. Consider the program:
#pragma token EXP rvalue:long:li# #define li 6 int f() { return sizeof(li); }
The definition of the token li
causes the expression, 6
, to be converted to long
(this is essential to separate the use of li
from its definition). The function, f
, then returns sizeof(long)
. If the token introduction was absent however f
would return sizeof(int)
.
Although they look similar, expression token definitions using #defines
are not quite the same as macro definitions. A macro can be defined by any preprocessing tokens which are then computed in phase 3 of translation as defined in the ISO C standard, whereas tokens are defined by assignment-expressions which are computed in phase 7. One of the consequences of this is illustrated by the program below:
#pragma token EXP rvalue:int :X# #define X M + 3 #define M sizeof(int) int f(int x) { return (x + X); }
If the token introduction of X
is absent, the program above will compile as, at the time the definition of X
is interpreted (when evaluating x + X
), both M
and X are in scope. When the token introduction is present the compilation will fail as the definition of X
, being part of translation phase 7, is interpreted when it is encountered and at this stage M
is not defined. This can be rectified by reversing the order of the definitions of X
and M
or by bracketing the definition of X
. i.e.
#define X (M+3)
Conversely consider:
#pragma token EXP rvalue:int:X# #define M sizeof(int) #define X M + 3 #undef M int M(int x) { return (x + X); }
The definition of X
is computed on line 3 when M
is in scope, not on line 6 where it is used. Token definitions can be used in this way to relieve some of the pressures on name spaces by undefining macros that are only used in token definitions. This facility should be used with care as it may not be a straightforward matter to convert the program back to a conventional C program.
Expression tokens can also be defined by declaring the exp-token-name that references the token to be an object with external linkage e.g.
#pragma token EXP lvalue:int:x# extern int x;
The semantics of this program are effectively the same as the semantics of:
#pragma token EXP lvalue:int:x# extern int _x; #define x _x
6. Statement tokens
The syntax for introducing a statement token is simply:
#pragma token STATEMENT init_globs# int g(int); int f(int x) { init_globs return g(x); }
Internal statement token identifiers reside in the macro name space. The optional name space, TAG
, should not appear in statement token introductions.
The use of statement tokens is analogous to the use of expression tokens (see §5). A new symbol, stat-token-name, has been introduced into the syntax analysis at phase 7 of translation as defined in the ISO C standard. This token is passed through to the syntax analyser whenever the preprocessor encounters an identifier referring to a statement token. A stat-token-name
can only occur as part of the statement syntax (ISO C standard, section 6.6).
As with expression tokens, statement tokens are defined using #define
statements. An example of this is shown below:
#pragma token STATEMENT i_globs# #define i_globs { int i = x; x = 3; }
The constraints on the definition of statement tokens are:
-
the use of labels is forbidden unless the definition of the statement token occurs at the outer level (i.e outside of any compound statement forming a function definition);
-
the use of return within the defining statement is not allowed.
The semantics of the defining statement are precisely the same as the semantics of a compound statement forming the definition of a function with no parameters and void result. The definition of statement tokens carries the same implications for phases of translation as the definition of expression tokens (see F.5 Expression tokens).
7. Type tokens
- 7.1. General type tokens
- 7.2. Integral type tokens
- 7.3. Arithmetic type tokens
- 7.4. Compound type tokens
- 7.5. Type token compatibility, definitions etc.
Type tokens are used to introduce references to types. The ISO C standard, section 6.1.2.5, identifies the following classification of types:
-
the type
char
; -
signed integral types;
-
unsigned integral types;
-
floating types;
-
character types;
-
enumeration types;
-
array types;
-
structure types;
-
union types;
-
function types;
-
pointer types;
These types fall into the following broader type classifications:
-
integral types - consisting of the signed integral types, the unsigned integral types and the type
char
; -
arithmetic types - consisting of the integral types and the floating types;
-
scalar types - consisting of the arithmetic types and the pointer types;
-
aggregate types - consisting of the structure and array types;
-
derived types - consisting of array, structure, union, function and pointer types;
-
derived declarator types - consisting of array, function and pointer types.
The classification of a type determines which operations are permitted on objects of that type. For example, the !
operator can only be applied to objects of scalar type. In order to reflect this, there are several type token introductions which can be used to classify the type to be referenced, so that the compiler can perform semantic checking on the application of operators. The possible type token introductions are:
type-token: TYPE VARIETY ARITHMETIC STRUCT UNION
7.1. General type tokens
The most general type token introduction is TYPE
. This introduces a type of unknown classification which can be defined to be any C type. Only a few generic operations can be applied to such type tokens, since the semantics must be defined for all possible substituted types. Assignment and function argument passing are effectively generic operations, apart from the treatment of array types. For example, according to the ISO C standard, even assignment is not permitted if the left operand has array type and we might therefore expect assignment of general token types to be illegal. Tokens introduced using the TYPE token introduction can thus be regarded as representing non-array types with extensions to represent array types provided by applying non-array semantics as described below.
Once general type tokens have been introduced, they can be used to construct derived declarator types in the same way as conventional type declarators. For example:
#pragma token TYPE t_t# #pragma token TYPE t_p# #pragma token NAT n# typedef t_t *ptr_type; /* introduces pointer type */ typedef t_t fn_type(t_p); /*introduces function type */ typedef t_t arr_type[n]; /*introduces array type */
The only standard conversion that can be performed on an object of general token type is the lvalue conversion (ISO C standard section 6.2). Lvalue conversion of an object with general token type is defined to return the item stored in the object. The semantics of lvalue conversion are thus fundamentally altered by the presence of a token introduction. If type t_t
is defined to be an array type the lvalue conversion of an object of type t_t
will deliver a pointer to the first array element. If, however, t_t
is defined to be a general token type, which is later defined to be an array type, lvalue conversion on an object of type t_t
will deliver the components of the array.
This definition of lvalue conversion for general token types is used to allow objects of general tokenised types to be assigned to and passed as arguments to functions. The extensions to the semantics of function argument passing and assignment are as follows: if the type token is defined to be an array then the components of the array are assigned and passed as arguments to the function call; in all other cases the assignment and function call are the same as if the defining type had been used directly.
The default name space for the internal identifiers for general type tokens is the ordinary name space and all such identifiers must reside in this name space. The local identifier behaves exactly as if it had been introduced with a typedef statement and is thus treated as a typedef-name by the syntax analyser.
7.2. Integral type tokens
The token introduction VARIETY
is used to introduce a token representing an integral type. A token introduced in this way can only be defined as an integral type and can be used wherever an integral type is valid.
Values which have integral tokenised types can be converted to any scalar type. Similarly values with any scalar type can be converted to a value with a tokenised integral type. The semantics of these conversions are exactly the same as if the type defining the token were used directly. Consider:
#pragma token VARIETYi_t
# short f(void) {i_t
x_i = 5; return x_i; } short g(void) { long x_i = 5; return x_i; }
Within the function f there are two conversions: the value, 5, of type int
, is converted to i_t
, the tokenised integral type, and a value of tokenised integral type i_t
is converted to a value of type short. If the type i_t
were defined to be long then the function f would be exactly equivalent to the function g.
The usual arithmetic conversions described in the ISO C standard (section 6.3.1.5) are defined on integral type tokens and are applied where required by the ISO C standard.
The integral promotions are defined according to the rules introduced in Chapter 4. These promotions are first applied to the integral type token and then the usual arithmetic conversions are applied to the resulting type.
As with general type tokens, integral type tokens can only reside in their default name space, the ordinary name space (the optional name-space, TAG, cannot be specified in the token introduction). They also behave as though they had been introduced using a typedef statement.
7.3. Arithmetic type tokens
The token introduction ARITHMETIC
introduces an arithmetic type token. In theory, such tokens can be defined by any arithmetic type, but the current implementation of the compiler only permits them to be defined by integral types. These type tokens are thus exactly equivalent to the integral type tokens introduced using the token introduction VARIETY
.
7.4. Compound type tokens
For the purposes of this document, a compound type is a type describing objects which have components that are accessible via member selectors. All structure and union types are thus compound types, but, unlike structure and union types in C, compound types do not necessarily have an ordering on their member selectors. In particular, this means that some objects of compound type cannot be initialised with an initialiser-list (see ISO C standard section 6.5.7).
Compound type tokens are introduced using either the STRUCT
or UNION
token introductions. A compound type token can be defined by any compound type, regardless of the introduction used. It is expected, however, that programmers will use STRUCT
for compound types with non-overlapping member selectors and UNION
for compound types with overlapping member selectors. The compound type token introduction does not specify the member selectors of the compound type - these are added later (see §9).
Values and objects with tokenised compound types can be used anywhere that a structure and union type can be used.
Internal identifiers of compound type tokens can reside in either the ordinary name space or the tag name space. The default is the ordinary name space; identifiers placed in the ordinary name space behave as if the type had been declared using a typedef statement. If the identifier, id say, is placed in the tag name space, it is as if the type had been declared as struct id or union id. Examples of the introduction and use of compound type tokens are shown below:
#pragma token STRUCT n_t# #pragma token STRUCT TAG s_t# #pragma token UNION TAG u_t# void f() { n_t x1; struct n_t x2; /* Illegal, n_t not in the tag name space */ s_t x3; /* Illegal, s_t not in the ordinary name space*/ struct s_t x4; union u_t x5; }
7.5. Type token compatibility, definitions etc.
A type represented by an undefined type token is incompatible (ISO C standard section 6.1.3.6) with all other types except for itself. A type represented by a defined type token is compatible with everything that is compatible with its definition.
Type tokens can only be defined by using one of the operations known as type-resolution
and assignment-resolution
. Note that, as type token identifiers do not reside in the macro name space, they cannot be defined using #define statements.
Type-resolution
operates on two types and is essentially identical to the operation of type compatibility (ISO C standard section 6.1.3.6) with one major exception. In the case where an undefined type token is found to be incompatible with the type with which it is being compared, the type token is defined to be the type with which it is being compared, thereby making them compatible.
The ISO C standard prohibits the repeated use of typedef statements to define a type. However, in order to allow type resolution, the compiler allows types to be consistently redefined using multiple typedef statements if:
-
there is a resolution of the two types;
-
as a result of the resolution, at least one token is defined.
As an example, consider the program below:
#pragma token TYPE t_t# typedef t_t *ptr_t_t; typedef int **ptr_t_t;
The second definition of ptr_t_t
causes a resolution of the types t_t *
and int **
. The rules of type compatibility state that two pointers are compatible if their dependent types are compatible, thus type resolution results in the definition of t_t
as int *
.
Type-resolution
can also result in the definition of other tokens. The program below results in the expression token N
being defined as (4 * sizeof(int))
.
#pragma token EXP rvalue:int:N# typedef int arr[N]; typedef int arr[4 * sizeof(int)];
The type-resolution
operator is not symmetric; a resolution of two types, A
and B
say, is an attempt to resolve type A
to type B
. Thus only the undefined tokens of A
can be defined as a result of applying the type-resolution operator. In the examples above, if the typedefs were reversed, no type-resolution
would take place and the types would be incompatible.
Assignment-resolution
is similar to type-resolution
but it occurs when converting an object of one type to another type for the purposes of assignment. Suppose the conversion is not possible and the type to which the object is being converted is an undefined token type. If the token can be defined in such a way that the conversion is possible, then that token will be suitably defined. If there is more than one possible definition, the definition causing no conversion will be chosen.
8. Specifying conversions using the token syntax
The token syntax described in Annex F can be used to fine-tune the conversion analysis and integer range checks described in section 3.2.2 and chapter 4 respectively.
8.1. User-defined conversions
In the example:
#pragma token TYPE IP# #pragma no_def IP #pragma token PROC{ TYPE t ,EXP rvalue : t* : e | EXP e}EXP rvalue:IP:p_to_ip# #pragma TenDRA conversion p_to_ip allow void f(void) { IP i, *pip; i = pip; }
the conversion from type pointer to IP to IP would normally result in an error. However the pragma:
#pragma TenDRA conversion conv_list allow
allows users to define their own conversions between types that would otherwise be incompatible. Each identifier in the conv_list must be a local identifier for a PROC
token (see F.9 Procedure tokens), taking exactly one argument which must be an expression. The procedure must also deliver an expression and both the parameter and the result must be rvalues. When attempting the conversion of a value from one type to another, either by assignment or casting, if that conversion would not normally be permitted, then, for each token introduced as a conversion token by the pragma above:
An attempt is made to resolve the type of the token result to the type to which the value is being converted.
If the result is resolved and the value to be converted is a suitable argument for the token procedure, the token procedure is applied to implement the conversion.
If no suitable conversion token can be found, an error will be raised.
8.2. Specifying integer promotions
Whenever an integral type is used, e.g. in a call to a traditionally defined function, its promotion must be computed. If no promotion type has been specified, the compiler will raise an error. Although programs will generally use the default rules provided in the built-in compilation modes, the TenDRA compiler also allows programmers to specify their own set of integer promotion rules. Such promotions can be split into two categories: literal promotions and computed promotions.
- Literal promotions
-
The promoted type is supplied directly using the pragma:
#pragma promote type-name : type-name
The second type-name specifies the promoted type of the first type-name. Both types must be integral types.
- Computed promotions
-
The pragma:
#pragma compute promote proc-token
allows the programmer to provide the identification of a procedure token (see F.9 Procedure tokens) for computing the promotion type of any integral type. This token is then called whenever the promotion of a type without a literal promotion is required. The procedure token proc-token must be declared as:
#pragma token PROC ( VARIETY ) VARIETY proc-token #
The TenDRA technology provides two pre-defined procedure tokens, identified by
~promote
and~sign-promote
. These procedures perform integer promotions according to the ISO/ANSI promotion rules or according to the traditional signed promotion rules respectively. The built-in compilation environments (see chapter 2) use these tokens to specify the appropriate set of integer promotion rules.
9. Selector tokens
The use of selector tokens is the primary method of adding member selectors to compound type tokens. (The only other method is to define the compound type token to be a particular structure or union type.) The introduction of new selector tokens can occur at any point in a program and they can thus be used to add new member selectors to existing compound types.
The syntax for introducing member selector tokens as follows:
selector-token: MEMBER selector-type-name : type-name : selector-type-name: type-nametype-name % constant-expression
The selector-type-name specifies the type of the object selected by the selector token. If the selector-type-name is a plain type-name, the member selector token has that type. If the selector-type-name consists of a type-name and a constant-expression separated by a %
sign, the member selector token refers to a bitfield of type type-name and width constant-expression. The second type-name gives the compound type to which the member selector belongs. For example:
#pragma token STRUCT TAG s_t# #pragma token MEMBER char*: struct s_t:s_t_mem#
introduces a compound token type, s_t
, which has a member selector, s_t_mem
, which selects an object of type char *
.
Internal identifiers of member selector tokens can only reside in the member name space of the compound type to which they belong. Clearly this is also the default name space for such identifiers.
When structure or union types are declared, according to the ISO C standard there is an implied ordering on the member selectors. In particular this means that:
-
during initialisation with an initialiser-list the identified members of a structure are initialised in the order in which they were declared. The first identified member of a union is initialised;
-
the addresses of structure members will increase in the order in which they were declared.
The member selectors introduced as selector tokens are not related to any other member selectors until they are defined. There is thus no ordering on the undefined tokenised member selectors of a compound type. If a compound type has only undefined token selectors, it cannot be initialised with an initialiser-list. There will be an ordering on the defined members of a compound type and in this case, the compound type can be initialised automatically.
The decision to allow unordered member selectors has been taken deliberately in order to separate the decision of which members belong to a structure from that of where such member components lie within the structure. This makes it possible to represent extensions to APIs which require extra member selectors to be added to existing compound types.
As an example of the use of token member selectors, consider the structure lconv
specified in the ISO C Standard library (section 7.4.3.1). The standard does not specify all the members of struct lconv
or the order in which they appear. This type cannot be represented naturally by existing C types, but can be described by the token syntax.
There are two methods for defining selector tokens, one explicit and one implicit. As selector token identifiers do not reside in the macro name space they cannot be defined using #define statements.
Suppose A
is an undefined compound token type and mem
is an undefined selector token for A
. If A
is later defined to be the compound type B
and B
has a member selector with identifier mem
then A.mem
is defined to be B.mem
providing the type of A.mem
can be resolved to the type of B.mem
. This is known as implicit selector token definition.
In the program shown below the redefinition of the compound type s_t
causes the token for the selector mem_x
to be implicitly defined to be the second member of struct s_tag
. The consequential type resolution leads to the token type t_t
being defined to be int.
#pragma token TYPE t_t# #pragma token STRUCT s_t# #pragma token MEMBER t_t : s_t : mem_x# #pragma token MEMBER t_t : s_t : mem_y# struct s_tag { int a, mem_x, b; } typedef struct s_tag s_t;
Explicit selector token definition takes place using the pragma:
#pragma DEFINE MEMBER type-nameidentifier : member-designator member-designator: identifier identifier . member-designator
The type-name specifies the compound type to which the selector belongs.
The identifier provides the identification of the member selector within that compound type.
The member-designator provides the definition of the selector token. It must identify a selector of a compound type.
If the member-designator is an identifier, then the identifier must be a member of the compound type specified by the type-name. If the member-designator is an identifier, id say, followed by a further member-designator, M say, then:
-
the identifier id must be a member identifying a selector of the compound type specified by type-name;
-
the type of the selector identified by id must have compound type, C say;
-
the
member-designator
M must identify a member selector of the compound type C.
As with implicit selector token definitions, the type of the selector token must be resolved to the type of the selector identified by the member-designator.
In the example shown below, the selector token mem is defined to be the second member of struct s which in turn is the second member of struct s_t.
#pragma token STRUCT s_t# #pragma token MEMBER int : s_t : mem# typedef struct {int x; struct {char y; int z;} s; } s_t; #pragma DEFINE MEMBER s_t : mem s.z
10. Procedure tokens
- 10.1. General procedure tokens
- 10.2. Simple procedure tokens
- 10.3. Function procedure tokens
- 10.4. Defining procedure tokens
Consider the macro SWAP
defined below:
#define SWAP(T, A, B) { \ T x; \ x=B; \ B=A; \ A=x; \ }
SWAP
can be thought of as a statement that is parameterised by a type and two expressions.
Procedure tokens are based on this concept of parameterisation. Procedure tokens reference program constructs that are parameterised by other program constructs.
There are three methods of introducing procedure tokens. These are described in the sections below.
10.1. General procedure tokens
The syntax for introducing a general procedure token is:
general procedure: PROC { bound-toks? | prog-pars? } token-introduction simple procedure: PROC ( bound-toks? ) token-introduction bound-toks: bound-token bound-token, bound-toks bound-token: token-introduction name-space?identifier prog-pars: program-parameterprogram-parameter, prog-pars program parameter: EXP identifier STATEMENT identifier TYPE type-name-identifier MEMBER type-name-identifier : identifier
The final token-introduction specifies the kind of program construct being parameterised. In the current implementation of the compiler, only expressions and statements may be parameterised. The internal procedure token identifier is placed in the default name space of the program construct which it parameterises. For example, the internal identifier of a procedure token parameterising an expression would be placed in the macro name space.
The bound-toks
are the bound token dependencies which describe the program constructs upon which the procedure token depends. These should not be confused with the parameters of the token. The procedure token introduced in:
#pragma token PROC {TYPE t,EXP rvalue:t**:e|EXP e} EXP:rvalue:t:dderef#
is intended to represent a double dereference and depends upon the type of the expression to be dereferenced and upon the expression itself but takes only one argument, namely the expression, from which both dependencies can be deduced.
The bound token dependencies are introduced in exactly the same way as the tokens described in the previous sections with the identifier corresponding to the internal identification of the token. No external identification is allowed. The scope of these local identifiers terminates at the end of the procedure token introduction, and whilst in scope, they hide all other identifiers in the same name space. Such tokens are referred to as bound
because they are local to the procedure token.
Once a bound token dependency has been introduced, it can be used throughout the rest of the procedure token introduction in the construction of other components.
The prog-pars are the program parameters. They describe the parameters with which the procedure token is called. The bound token dependencies are deduced from these program parameters.
Each program parameter is introduced with a keyword expressing the kind of program construct that it represents. The keywords are as follows:
EXP
-
The parameter is an expression and the identifier following
EXP
must be the identification of a bound token for an expression. When the procedure token is called, the corresponding parameter must be an assignment-expression and is treated as the definition of the bound token, thereby providing definitions for all dependencies relating to that token. For example, the call of the procedure token dderef, introduced above, in the code below:char f(char **c_ptr_ptr) { return dderef(c_ptr_ptr); }
causes the expression, e, to be defined to be
c_ptr_ptr
thus resolving the typet**
to bechar **
. The typet
is hence defined to bechar
, also providing the type of the expression obtained by the application of the procedure tokendderef
; STATEMENT
-
The parameter is a statement. Its semantics correspond directly to those of
EXP
; TYPE
-
The parameter is a type. When the procedure token is applied, the corresponding argument must be a
type-name
. The parameter type is resolved to the argument type in order to define any related dependencies; MEMBER
-
The parameter is a member selector. The
type-name
specifies the composite type to which the member selector belongs and the identifier is the identification of the member selector. When the procedure token is applied, the corresponding argument must be a member-designator of the compound type.
Currently PROC
tokens cannot be passed as program parameters.
10.2. Simple procedure tokens
In cases where there is a direct, one-to-one correspondence between the bound token dependencies and the program parameters a simpler form of procedure token introduction is available.
Consider the two procedure token introductions below, corresponding to the macro SWAP
described earlier.
/* General procedure introduction */ #pragma token PROC{TYPE t,EXP lvalue:t:e1,EXP lvalue:t:e2 | \ TYPE t,EXP e1,EXP e2 } STATEMENT SWAP# /* Simple procedure introduction */ #pragma token PROC(TYPE t,EXP lvalue:t:,EXP lvalue:t: ) STATEMENT SWAP#
The simple-token syntax is similar to the bound-token syntax, but it also introduces a program parameter for each bound token. The bound token introduced by the simple-token syntax is defined as though it had been introduced with the bound-token syntax. If the final identifier is omitted, then no name space can be specified, the bound token is not identified and in effect there is a local hidden identifier.
10.3. Function procedure tokens
One of the commonest uses of simple procedure tokens is to represent function in-lining. In this case, the procedure token represents the in-lining of the function, with the function parameters being the program arguments of the procedure token call, and the program construct resulting from the call of the procedure token being the corresponding in-lining of the function. This is a direct parallel to the use of macros to represent functions.
The syntax for introducing function procedure tokens is:
function-procedure: FUNC type-name :
The type-name must be a prototyped function type. The pragma results in the declaration of a function of that type with external linkage and the introduction of a procedure token suitable for an in-lining of the function. (If an ellipsis is present in the prototyped function type, it is used in the function declaration but not in the procedure token introduction.) Every parameter type and result type is mapped onto the token introduction:
EXP rvalue:
The example below:
#pragma token FUNC int(int): putchar#
declares a function, putchar
, which returns an int
and takes an int
as its argument, and introduces a procedure token suitable for in-lining putchar. Note that:
#undef putchar
will remove the procedure token but not the underlying function.
10.4. Defining procedure tokens
All procedure tokens are defined by the same mechanism. Since simple and function procedure tokens can be transformed into general procedure tokens, the definition will be explained in terms of general procedure tokens.
The syntax for defining procedure tokens is given below and is based upon the standard parameterised macro definition. However, as in the definitions of expressions and statements, the #defines
of procedure token identifiers are evaluated in phase 7 of translation as described in the ISO C standard.
#define identifier ( id-list? ) assignment-expression #define identifier ( id-list? ) statement id-list: identifieridentifer, id-list
The id-list must correspond directly to the program parameters of the procedure token introduction. There must be precisely one identifier for each program parameter. These identifiers are used to identify the program parameters of the procedure token being defined and have a scope that terminates at the end of the procedure token definition. They are placed in the default name spaces for the kinds of program constructs which they identify.
None of the bound token dependencies can be defined during the evaluation of the definition of the procedure token since they are effectively provided by the arguments of the procedure token each time it is called. To illustrate this, consider the example below based on the dderef token used earlier.
#pragma token PROC{TYPE t, EXP rvalue:t**:e|EXP e}EXP rvalue:t:dderef# #define dderef (A) (**(A))
The identifiers t
and e
are not in scope during the definition, being merely local identifiers for use in the procedure token introduction. The only identifier in scope is A
. A
identifies an expression token which is an rvalue whose type is a pointer to a pointer to a type token. The expression token and the type token are provided by the arguments at the time of calling the procedure token.
Again, the presence of a procedure token introduction can alter the semantics of a program. Consider the program below.
#pragma token PROC {TYPE t, EXP lvalue:t:,EXP lvalue:t:}STATEMENT SWAP# #define SWAP(T, A, B) { \ T x; \ x = B; \ B = A; \ A = x; \ } void f(int x, int y) { SWAP(int, x, y) }
Function procedure tokens are introduced with tentative implicit definitions, defining them to be direct calls of the functions they reference and effectively removing the in-lining capability. If a genuine definition is found later in the compilation, it overrides the tentative definition. An example of a tentative definition is shown below:
#pragma token FUNC int(int, long) : func# #define func(A, B) (func) (A, B)
11. Tokens and APIs
In Chapter 1 we mentioned that one of the main problems in writing portable software is the lack of separation between specification and implementation of APIs. The TenDRA technology uses the token syntax described in the previous sections to provide an abstract description of an API specification. Collections of tokens representing APIs are called interfaces
. Tchk can compile programs with these interfaces in order to check applications against API specifications independently of any particular implementation that may be present on the developer's machine.
In order to produce executable code, definitions of the interface tokens must be provided on all target machines. This is done by compiling the interfaces with the system headers and libraries.
When developing applications, programmers must ensure that they do not accidentally define a token expressing an API. Implementers of APIs, however, do not want to inadvertently fail to define a token expressing that API. Token definition states have been introduced to enable programmers to instruct the compiler to check that tokens are defined when and only when they wish them to be. This is fundamental to the separation of programs into portable and unportable parts.
When tokens are first introduced, they are in the free state. This means that the token can be defined or left undefined and if the token is defined during compilation, its definition will be output as TDF.
Once a token has been given a valid definition, its definition state moves to defined. Tokens may only be defined once. Any attempt to define a token in the defined state is flagged as an error.
There are three more token definition states which may be set by the programmer. These are as follows:
- Indefinable
-
The token is not defined and must not be defined. Any attempt to define the token will cause an error. When compiling applications, interface tokens should be in the indefinable state. It is not possible for a token to move from the state of defined to indefinable;
- Committed
-
The token must be defined during the compilation. If no definition is found the compiler will raise an error. Interface tokens should be in the committed state when being compiled with the system headers and libraries to provide definitions;
- Ignored
-
any definition of the token that is assigned during the compilation of the program will not be output as TDF;
These token definition states are set using the pragmas:
#pragma token-op token-id-list? token-op: define no_def ignore interface token-id-list: TAG?identifierdot-list?token-id-list? dot-list: . member-designator
The token-id-list is the list of tokens to which the definition state applies. The tokens in the token-id-list are identified by an identifier, optionally preceded by TAG
. If TAG
is present, the identifier refers to the tag name space, otherwise the macro and ordinary name spaces are searched for the identifier. If there is no dot-list present, the identifier must refer to a token. If the dot-listis present, the identifier must refer to a compound type and the member-designator must identify a member selector token of that compound type.
The token-op specifies the definition state to be associated with the tokens in the token-id-list
. There are three literal operators and one context dependent operator, as follows:
-
no_def
causes the token state to move to indefinable. -
define
causes the token state to move to committed; -
ignore
causes the token state to move to ignored; -
interface
is the context dependent operator and is used when describing extensions to existing APIs.
As an example of an extension API, consider the POSIX stdio.h
. This is an extension of the ANSI stdio.h
and uses the same tokens to represent the common part of the interface. When compiling applications, nothing can be assumed about the implementation of the ANSI tokens accessed via the POSIX API so they should be in the indefinable state. When the POSIX tokens are being implemented, however, the ANSI implementations can be assumed. The ANSI tokens are then in the ignored state. (Since the definitions of these tokens will have been output already during the definition of the ANSI interface, they should not be output again.)
The interface
operator has a variable interpretation to allow the correct definition state to be associated with these `base-API tokens'. The compiler associates a compilation state with each file it processes. These compilation states determine the interpretation of the interface operator within that file.
The default compilation state is the standard state. In this state the interface
operator is interpreted as the no_def
operator. This is the standard state for compiling applications in the presence of APIs;
Files included using:
#include header
have the same compilation state as the file from which they were included.
The implementation compilation state is associated with files included using:
#pragma implement interface header
In this context the interface
operator is interpreted as the define
operator.
Including a file using:
#pragma extend interface header
causes the compilation state to be extension unless the file from which it was included was in the standard state, in which case the compilation state is the standard state. In the extension state the interface
operator is interpreted as the ignore
operator.
12. Token syntax
The C and C++ producers allow place-holders for various categories of syntactic classes to be expressed using directives of the form:
#pragma TenDRA token token-spec
or simply:
#pragma token token-spec
These place-holders are represented as TDF tokens and hence are called tokens. These tokens stand for a certain type, expression or whatever which is to be represented by a certain named TDF token in the producer output. This mechanism is used, for example, to allow C API specifications to be represented target independently. The types, functions and expressions comprising the API can be described using #pragma token
directives and the target dependent definitions of these tokens, representing the implementation of the API on a particular machine, can be linked in later. This mechanism is described in detail elsewhere.
A summary of the grammar for the #pragma token
directives accepted by the C++ producer is given in tdfc2pragma.
12.1. Token specifications
A token specification is divided into two components, a token-introduction giving the token sort, and a token-identification giving the internal and external token names:
token-spec : token-introduction token-identification token-introduction : exp-token statement-token type-token member-token procedure-token token-identification : token-namespace? identifier # external-identifier? token-namespace : TAG external-identifier : - preproc-token-list
The TAG
qualifier is used to indicate that the internal name lies in the C tag namespace. This only makes sense for structure and union types. The external token name can be given by any sequence of preprocessing tokens. These tokens are not macro expanded. If no external name is given then the internal name is used. The special external name -
is used to indicate that the token does not have an associated external name, and hence is local to the current translation unit. Such a local token must be defined. White space in the external name (other than at the start or end) is used to indicate that a TDF unique name should be used. The white space serves as a separator for the unique name components.
12.1.1. Expression tokens
Expression tokens are specified as follows:
exp-token : EXP exp-storage? : type-id : NAT INTEGER
representing a expression of the given type, a non-negative integer constant and general integer constant, respectively. Each expression has an associated storage class:
exp-storage : lvalue rvalue const
indicating whether it is an lvalue, an rvalue or a compile-time constant expression. An absent exp-storage is equivalent to rvalue
. All expression tokens lie in the macro namespace; that is, they may potentially be defined as macros.
For backwards compatibility with the C producer, the directive:
#pragma TenDRA++ rvalue token as const allow
causes rvalue
tokens to be treated as const
tokens.
12.1.2. Statement tokens
Statement tokens are specified as follows:
statement-token : STATEMENT
All statement tokens lie in the macro namespace.
12.1.3. Type tokens
Type tokens are specified as follows:
type-token : TYPE VARIETY VARIETY signed VARIETY unsigned FLOAT ARITHMETIC SCALAR CLASS STRUCT UNION
representing a generic type, an integral type, a signed integral type, an unsigned integral type, a floating point type, an arithmetic (integral or floating point) type, a scalar (arithmetic or pointer) type, a class type, a structure type and a union type respectively.
Floating-point, arithmetic and scalar token types have not yet been implemented correctly in either the C or C++ producers.
12.1.4. Member tokens
Member tokens are specified as follows:
member-token : MEMBER access-specifier? member-type-id : type-id :
where an access-specifier of public
is assumed if none is given. The member type is given by:
member-type-id : type-id type-id % constant-expression
where %
is used to denote bitfield members (since :
is used as a separator). The second type denotes the structure or union the given member belongs to. Different types can have members with the same internal name, but the external token name must be unique. Note that only non-static data members can be represented in this form.
Two declarations for the same MEMBER
token (including token definitions) should have the same type, however the directive:
#pragma TenDRA++ incompatible member declaration allow
allows declarations with different types, provided these types have the same size and alignment requirements.
12.1.5. Procedure tokens
Procedure, or high-level, tokens are specified in one of three ways:
procedure-token : general-procedure simple-procedure function-procedure
All procedure tokens (except ellipsis functions - see below) lie in the macro namespace. The most general form of procedure token specifies two sets of parameters. The bound parameters are those which are used in encoding the actual TDF output, and the program parameters are those which are specified in the program. The program parameters are expressed in terms of the bound parameters. A program parameter can be an expression token parameter, a statement token parameter, a member token parameter, a procedure token parameter or any type. The bound parameters are deduced from the program parameters by a similar process to that used in template argument deduction.
general-procedure : PROC { bound-toks? | prog-pars? } token-introduction bound-toks : bound-token bound-token , bound-toks bound-token : token-introduction token-namespace? identifier prog-pars : program-parameter program-parameter , prog-pars program-parameter : EXP identifier STATEMENT identifier TYPE type-id MEMBER type-id : identifier PROC identifier
The simplest form of a general-procedure is one in which the prog-pars correspond precisely to the bound-toks. In this case the syntax:
simple-procedure : PROC ( simple-toks? ) token-introduction simple-toks : simple-token simple-token , simple-toks simple-token : token-introduction token-namespace? identifier?
may be used. Note that the parameter names are optional.
A function token is specified as follows:
function-procedure : FUNC type-id :
where the given type is a function type. This has two effects: firstly a function with the given type is declared; secondly, if the function type has the form:
r ( p1, ...., pn )
a procedure token with sort:
PROC ( EXP rvalue : p1 :, ...., EXP rvalue : pn : ) EXP rvalue : r :
is declared. For ellipsis function types only the function, not the token, is declared. Note that the token behaves like a macro definition of the corresponding function. Unless explicitly enclosed in a linkage specification, a function declared using a FUNC
token has C linkage. Note that it is possible for two FUNC
tokens to have the same internal name, because of function overloading, however external names must be unique.
The directive:
#pragma TenDRA incompatible interface declaration allow
can be used to allow incompatible redeclarations of functions declared using FUNC
tokens. The token declaration takes precedence.
Certain of the more complex examples of PROC
tokens such as, for example, tokens with PROC
parameters, have not been implemented in either the C or C++ producers.
12.2. Token arguments
As mentioned above, the program parameters for a PROC
token are those specified in the program itself. These arguments are expressed as a comma-separated list enclosed in brackets, the form of each argument being determined by the corresponding program parameter.
An EXP
argument is an assignment expression. This must be an lvalue for lvalue
tokens and a constant expression for const
tokens. The argument is converted to the token type (for lvalue
tokens this is essentially a conversion between the corresponding reference types). A NAT
or INTEGER
argument is an integer constant expression. In the former case this must be non-negative.
A STATEMENT
argument is a statement. This statement should not contain any labels or any goto
or return
statements.
A type argument is a type identifier. This must name a type of the correct category for the corresponding token. For example, a VARIETY
token requires an integral type.
A member argument must describe the offset of a member or nested member of the given structure or union type. The type of the member should agree with that of the MEMBER
token. The general form of a member offset can be described in terms of member selectors and array indexes as follows:
member-offset : ::? id-expression member-offset . ::? id-expression member-offset [ constant-expression ]
A PROC
argument is an identifier. This identifier must name a PROC
token of the appropriate sort.
12.3. Defining tokens
Given a token specification of a syntactic object and a normal language definition of the same object (including macro definitions if the token lies in the macro namespace), the producers attempt to unify the two by defining the TDF token in terms of the given definition. Whether the token specification occurs before or after the language definition is immaterial. Unification also takes place in situations where, for example, two types are known to be compatible. Multiple consistent explicit token definitions are allowed by default when allowed by the language; this is controlled by the directive:
#pragma TenDRA compatible token allow
The default unification behaviour may be modified using the directives:
#pragma TenDRA no_def token-list #pragma TenDRA define token-list #pragma TenDRA reject token-list
or equivalently:
#pragma no_def token-list #pragma define token-list #pragma ignore token-list
which set the state of the tokens given in token-list. A state of no_def
means that no unification is attempted and that any attempt to explicitly define the token results in an error. A state of define
means that unification takes place and that the token must be defined somewhere in the translation unit. A state of reject
means that unification takes place as normal, but any resulting token definition is discarded and not output to the TDF capsule.
If a token with the state define
is not defined, then the behaviour depends on the sort of the token. A FUNC
token is implicitly defined in terms of its underlying function, such as:
#define f( a1, ...., an ) ( f ) ( a1, ...., an )
Other undefined tokens cause an error. This behaviour can be modified using the directives:
#pragma TenDRA++ implicit token definition allow #pragma TenDRA++ no token definition allow
respectively.
The primitive operations, no_def
, define
and reject
, can also be expressed using the context sensitive directive:
#pragma TenDRA interface token-list
or equivalently:
#pragma interface token-list
By default this is equivalent to no_def
, but may be modified by inclusion using one of the directives:
#pragma TenDRA extend header-name #pragma TenDRA implement header-name
or equivalently:
#pragma extend interface header-name #pragma implement interface header-name
These are equivalent to:
#include header-name
except that the form [....]
is allowed as a header name. This is equivalent to <....>
except that it starts the directory search after the point at which the including file was found, rather than at the start of the path (i.e. it is equivalent to the #include_next
directive found in some preprocessors). The effect of the extend
directive on the state of the interface
directive is as follows:
no_def -> no_def define -> reject reject -> reject
The effect of the implement
directive is as follows:
no_def -> define define -> define reject -> reject
That is to say, a implement
directive will cause all the tokens in the given header to be defined and their definitions output. Any tokens included in this header by extend
may be defined, but their definitions will not be output. This is precisely the behaviour which is required to ensure that each token is defined exactly once in an API library build.
The lists of tokens in the directives above are expressed in the form:
token-list : token-id token-list? # preproc-token-list
where a token-id represents an internal token name:
token-id : token-namespace? identifier type-id . identifier
Note that member tokens are specified by means of both the member name and its parent type. In this type specifier, TAG
, rather than class
, struct
or union
, may be used in elaborated type specifiers for structure and union tokens. If the token-id names an overloaded function then the directive is applied to all FUNC
tokens of that name. It is possible to be more selective using the #
form which allows the external token name to be specified. Such an entry must be the last in a token-list.
A related directive has the form:
#pragma TenDRA++ undef token token-list
which undefines all the given tokens so that they are no longer visible.
As noted above, a macro is only considered as a token definition if the token lies in the macro namespace. Tokens which are not in the macro namespace, such as types and members, cannot be defined using macros. Occasionally API implementations do define member selector as macros in terms of other member selectors. Such a token needs to be explicitly defined using a directive of the form:
#pragma TenDRA member definition type-id : identifier member-offset
where member-offset is as above.