Tokens and APIs

11. Tokens and APIs

11.1. Application programming interfaces
11.2. Linking to APIs
11.3. Target independent headers, unique_extern
11.4. Language programming interfaces
11.5. Namespaces and APIs

All of the examples of the use of TOKENs so far given have really been as abbreviations for commonly used constructs, e.g. the EXP OFFSETS for fields of structures. However, the real justification for TOKENs are their use as abstractions for things defined in libraries or application program interfaces (APIs).

11.1. Application programming interfaces

APIs usually do not give complete language definitions of the operations and values that they contain; generally, they are defined informally in English giving relationships between the entities within them. An API designer should allow implementors the opportunity of choosing actual definitions which fit their hardware and the possibility of changing them as better algorithms or representations become available.

The most commonly quoted example is the representation of the type FILE and its related operations in C. The ANSI C definition gives no common representation for FILE; its implementation is defined to be platform-dependent. A TDF producer can assume nothing about FILE; not even that it is a structure. The only things that can alter or create FILEs are also entities in the Ansi-C API and they will always refer to FILEs via a C pointer. Thus TDF abstracts FILE as a SHAPE TOKEN with no parameters, make_tok(T_FILE) say. Any program that uses FILE would have to include a TOKDEC introducing T_FILE:

make_tokdec(T_FILE, empty, shape())

and anywhere that it wished to refer to the SHAPE of FILE it would do:

shape_apply_token(make_tok(T_FILE), ())

Before this program is translated on a given platform, the actual SHAPE of FILE must be supplied. This would be done by linking a TDF CAPSULE which supplies the TOKDEF for the SHAPE of FILE which is particular to the target platform.

Many of the C operations which use FILEs are explicitly allowed to be expanded as either procedure calls or as macros. For example, putc(c,f) may be implemented either as a procedure call or as the expansion of macro which uses the fields of f directly. Thus, it is quite natural for putc(c, f) to be represented in TDF as an EXP TOKEN with two EXP parameters which allows it to be expanded in either way. Of course, this would be quite distinct from the use of putc as a value (as a proc parameter of a procedure for example) which would require some other representation. One such representation that comes to mind might be to simply to make a TAGDEC for the putc value, supplying its TAGDEF in the ANSI API CAPSULE for the platform. This might prove to be rather short-sighted, since it denies us the possibility that the putc value itself might be expanded from other values and hence it would be better as another parameterless TOKEN. I have not come across an actual API expansion for the putc value as other than a simple TAG; however the FILE* value stdin is sometimes expressed as:

#define stdin &_iob[0]

which illustrates the point. It is better to have all of the interface of an API expressed as TOKENs to give both generality and flexibility across different platforms.

11.2. Linking to APIs

In general, each API requires platform-dependent definitions to be supplied by a combination of TDF linking and system linking for that platform. This is illustrated in the following diagram giving the various phases involved in producing a runnable program.

Figure 9. TDF Production, Linking and Translating

There will be CAPSULEs for each API on each platform giving the expansions for the TOKENs involved, usually as uses of identifiers which will be supplied by system linking from some libraries. These CAPSULEs would be derived from the header files on the platform for the API in question, usually using some automatic tools. For example, there will be a TDF CAPSULE (derived from <stdio.h>) which defines the TOKEN T_FILE as the SHAPE for FILE, together with definitions for the TOKENs for putc, stdin, etc., in terms of identifiers which will be found in the library libc.a.

11.3. Target independent headers, unique_extern

Any producer which uses an API will use system independent information to give the common interface TOKENs for this API. In the C producer, this is provided by header files using pragmas, which tell the producer which TOKENs to use for the particular constructs of the API . In any target-independent CAPSULE which uses the API, these TOKENs would be introduced as TOKDECs and made globally accessible by using make_linkextern. For a world-wide standard API, the EXTERNAL "name" for a TOKEN used by make_linkextern should be provided by an application of unique_extern on a UNIQUE drawn from a central repository of names for entities in standard APIs; this repository would form a kind of super-standard for naming conventions in all possible APIs. The mechanism for controlling this super-standard has yet to be set up, so at the moment all EXTERN names are created by string_extern.

An interesting example in the use of TOKENs comes in abstracting field names. Often, an API will say something like "the type Widget is a structure with fields alpha, beta ..." without specifying the order of the fields or whether the list of fields is complete. The field selection operations for Widget should then be expressed using EXP OFFSET TOKENs; each field would have its own TOKEN giving its offset which will be filled in when the target is known. This gives implementors on a particular platform the opportunity to reorder fields or add to them as they like; it also allows for extension of the standard in the same way.

The most common SORTs of TOKENs used for APIs are SHAPEs to represent types, and EXPs to represent values, including procedures and constants. NATs and VARIETYs are also sometimes used where the API does not specify the types of integers involved. The other SORTs are rarely used in APIs; indeed it is difficult to imagine any realistic use of TOKENs of SORT BOOL. However, the criterion for choosing which SORTs are available for TOKENisation is not their immediate utility, but that the structural integrity and simplicity of TDF is maintained. It is fairly obvious that having BOOL TOKENs will cause no problems, so we may as well allow them.

11.4. Language programming interfaces

So far, I have been speaking as though a TOKENised API could only be some library interface, built on top of some language, like xpg3, posix, X etc. on top of C. However, it is possible to consider the constructions of the language itself as ideal candidates for TOKENisation. For example, the C for-statement could be expressed as TOKEN with four parameters. This TOKEN could be expanded in TDF in several different ways, all giving the correct semantics of a for-statement. A translator (or other tools) could choose the expansion it wants depending on context and the properties of the parameters. The C producer could give a default expansion which a lazy translator writer could use, but others might use expansions which might be more advantageous. This idea could be extended to virtually all the constructions of the language, giving what is in effect a C-language API; perhaps this might be called more properly a language programming interface (LPI). Thus, we would have TOKENs for C for-statements, C conditionals, C procedure calls, C procedure definitions etc. ^[h]

The notion of a producer for any language working to an LPI specific to the constructs of the language is very attractive. It could use different TOKENs to reflect the subtle differences between uses of similar constructs in different languages which might be difficult or impossible to detect from their expansions, but which could allow better optimisations in the object code. For example, Fortran procedures are slightly different from C procedures in that they do not allow aliasing between parameters and globals. While application of the standard TDF procedure calls would be semantically correct, knowledge of that the non-aliasing rule applies would allow some procedures to be translated to more efficient code. A translator without knowledge of the semantics implicit in the TOKENs involved would still produce correct code, but one which knew about them could take advantage of that knowledge.

I also think that LPIs would be a very useful tool for crystalising ideas on how languages should be translated, allowing one to experiment with expansions not thought of by the producer writer. This decoupling is also an escape clause allowing the producer writer to defer the implementation of a construct completely to translate-time or link-time, as is done at the moment in C for off-stack allocation. As such it also serves as a useful test-bed for TOKEN constructions which may in future become new constructors of core TDF.

11.5. Namespaces and APIs

Namespace problems are amongst the most difficult faced by standard defining bodies (for example, the ANSI and POSIX committees) and they often go to great lengths to specify which names should, and should not, appear when certain headers are included. (The position is set out in D. F. Prosser, Header and name space rules for UNIX systems (private communication), USL, 1993.)

For example, the intention, certainly in ANSI, is that each header should operate as an independent sub-API. Thus va_list is prohibited from appearing in the namespace when stdio.h is included (it is defined only in stdarg.h) despite the fact that it appears in the prototype:

int vprintf ( char *, va_list ) ;

This seeming contradiction is worked round on most implementations by defining a type __va_list in stdio.h which has exactly the same definition as va_list, and declaring vprintf as:

int vprintf ( char *, __va_list ) ;

This is only legal because __va_list is deemed not to corrupt the namespace because of the convention that names beginning with __ are reserved for implementation use.

This particular namespace convention is well-known, but there are others defined in these standards which are not generally known (and since no compiler I know tests them, not widely adhered to). For example, the ANSI header errno.h reserves all names given by the regular expression:

E[0-9A-Z][0-9a-z_A-Z]+

against macros (i.e. in all namespaces). By prohibiting the user from using names of this form, the intention is to protect against namespace clashes with extensions of the ANSI API which introduce new error numbers. It also protects against a particular implementation of these extensions - namely that new error numbers will be defined as macros.

A better example of protecting against particular implementations comes from POSIX. If sys/stat.h is included names of the form:

st_[0-9a-z_A-Z]+

are reserved against macros (as member names). The intention here is not only to reserve field selector names for future extensions to struct stat (which would only affect API implementors, not ordinary users), but also to reserve against the possibility that these field selectors might be implemented by macros. So our st_atime example in section 2.2.3 is strictly illegal because the procedure name st_atime lies in a restricted namespace. Indeed the namespace is restricted precisely to disallow this program.

As an exercise to the reader, how many of your programs use names from the following restricted namespaces (all drawn from ANSI, all applying to all namespaces)?

is[a-z][0-9a-z_A-Z]+ (ctype.h) to[a-z][0-9a-z_A-Z]+ (ctype.h) str[a-z][0-9a-z_A-Z]+ (stdlib.h)

With the TDF approach of describing APIs in abstract terms using the #pragma token syntax most of these namespace restrictions are seen to be superfluous. When a target independent header is included precisely the objects defined in that header in that version of the API appear in the namespace. There are no worries about what else might happen to be in the header, because there is nothing else. Also implementation details are separated off to the TDF library building, so possible namespace pollution through particular implementations does not arise.

Currently TDF does not have a neat way of solving the va_list problem. The present target independent headers use a similar workaround to that described above (exploiting a reserved namespace). (See the footnote in section 3.4.1.1.)

None of this is intended as criticism of the ANSI or POSIX standards. It merely shows some of the problems that can arise from the insufficient separation of code.

^[h]
Exercise for the reader: what are the SORTs of these parameters?
The current C producer does this for some of the constructs, but not in any systematic manner; perhaps it will change.