3. Commands
- 3.1. Pre-pass Mappings
- 3.2. Character Group Definitions
- 3.3. Whitespace Definition
- 3.4. Token Definitions
- 3.5. Default definition
- 3.6. Keyword definitions
- 3.7. Zone Definitions
- 3.8. Type Definitions
- 3.9. Action prototypes
Lexical analysers are described to Lexi by a sequence of commands. This section provides an explanation of each possible command, and explains their respective intended uses.
3.1. Pre-pass Mappings
The lexical analysis runs in two passes. The first pass, or pre-pass
stage permits replacements to be substituted before the main pass, under which tokenisation takes place. This gives a convenient mechanism for expressing trigraph-like substitutions as found in C. The syntax to define pre-pass replacements is:
MAPPING sequence + "->" + char ;
The string on the right (i.e. the value with which the matched string is replaced) may only contain one character, or an escape sequence which yields one character.
For example, to replace the trigraph ??= with a single #:
MAPPING "??=" -> "#" ;
This would replace instances of ??= with # before any tokenisation takes place. So the input a??=b would match the token definition:
TOKEN "a#b" -> a ;
(and so would simply a#b, as usual).
A group may be included in the character sequence to be replaced. For example:
MAPPING "[alpha]" -> " " ;
will replace any alphabetic character by a blank, assuming the alpha
group is suitably defined at that point. See §2.5 for details of including groups in sequences.
It is possible to use groups to NOT match characters in the group.
MAPPING "[^alpha]" -> " " ;
will replace any non alpha character by a blank.
Mappings are substituted repeatedly until no further mappings match. The order of replacement for mappings matching strings of equal length is undefined, and so it is an error to define a mapping which produces a character used at the start of any mapping, including itself. For example:
MAPPING "???" -> "?" ;
is illegal. To see why, consider the input aab
for the (illegal) mappings:
MAPPING "aa" -> "x" ; MAPPING "xb" -> y ;
Since the order of substitution for mappings matching strings of equal length is undefined, it is unclear if this will result in xb
or y
. Notice that C does not demand a ???
trigraph - perhaps for this very reason (or perhaps simply because it is redundant). This restriction applies no matter how the string defining the characters to be mapped is formed: for example, it is also illegal to define a mapping which maps to a character present in a group included at the start of another mapping.
Mappings bind from left to right. For example:
MAPPING "ab" -> "d" ; MAPPING "bc" -> "d" ;
For the input abc
will produce db
, not ad
.
Mappings matching sequences of longer lengths are replaced with higher precedence than mappings matching shorter lengths of the same values beginning the longer sequences. For example:
MAPPING "abcdef" -> "x" ; MAPPING "abcd" -> "y" ;
for the input abcdef
will produce x
, not yef
.
3.2. Character Group Definitions
A group is an unordered sets of characters; groups name these sets for use in strings . The syntax of a group definition is:
group-defn := "GROUP" + identifier + "=" + chars ;
The identifier
specified is the name by which the group may be referenced.
For example, the following are valid group definitions:
GROUP alpha = {A-Z} + {a-z} + "_" ; GROUP digit = {0-9} ; GROUP odds = "13579" ; GROUP even = "02468" ; GROUP vowels = "aeiou" ; GROUP anything = "atrf" + "HGMP" + {0-9} ;
Groups may include the sets of previously-defined groups. Any character in the referenced set will be included into the group definition. For example:
GROUP hexdigit = "[digit]ABCDEFabcdef" ; GROUP alphanum = "[alpha]" + {0-9} ;
assuming the groups alpha and digit have already been defined at that point. See §2.5 for details of the syntax for the chars production.
Groups may not contain characters which are present in other groups (i.e. they may not overlap). See the §3.4 section for further discussion of why this is so.
Macros to test if a character belongs to a group are provided in the generated code, of the form is_groupname()
. These must be passed the index into the look-up table containing the given character, obtained by lookup_char(c)
. For example:
is_alpha(lookup_char('a'))
would yield true, assuming the group alpha is suitably defined.
The group name white
may not be used for groups other than the whitespace definition; see §3.3 for details.
A group name is unique amongst groups; groups may only be defined once.
3.3. Whitespace Definition
Consecutive whitespace characters outside of tokens are skipped by the lexical analyser before each token is recognised. Whitespace is treated with the semantics of a single token delimiter. Lexi specifies whitespace by the special group name white
, which may not be used as an identifier to name other groups.
For special cases where whitespace has significance (a common example is inside string literals), token definitions may call user-defined functions which purposefully circumvent the whitespace-skipping features of Lexi.
The syntax is the same as for any §3.2, but with the special group name white
:
white-defn := "GROUP" + "white" + "=" + chars ;
The whitespace definition may be omitted, in which case it defaults to space, horizontal tabulation and newline. Therefore it is always present, even if the declaration is implicit. As with any group, it may not be defined multiple times.
For example:
GROUP white = " \t\n\r";
Aside from the additional semantics explained above, the whitespace group is present as any other group: it is present in the API as is_white()
, and may be included in §2.5 as [white]
.
It is illegal to define the whitespace group to contains characters which are present in token definitions, including groups which those tokens use.
3.4. Token Definitions
Tokens are sequences of characters read by the lexical analyser and produced as output. Each token as a unique identifier, which is passed to code calling Lexi, along with the characters read which form the token's spelling.
Tokens are usually the main component of a lexical analyser. In Lexi's case, the only situation in which there would be no token declarations is if the lexical analyser is to exclusively perform pre-pass mappings. The effect of specifying neither tokens nor pre-pass mappings is undefined.
The syntax for specifying tokens is:
token-defn := "TOKEN" + chars + "->" + action-list ; command-list := "TOKEN" + chars + "->" command + "," + command-list ;
An command is either a return terminal command, a discard command, a function call command. Inline actions (sidlike) will be introduce in version 2.0 but are not ready in trunk yet.
- The token discard command
-
It is represented by the token
$$
. It should be in the last position. If this is the only instruction, the token will be read and its spelling discarded. - The return terminal command
-
A return terminal command is either a Sid identifier or a non prefixed identifier.
return-terminal-command := $ + identifier ; | identifier ;
An example of these two forms:
TOKEN "token1" -> $sid-identifier ; TOKEN "token2" -> nonsid_identifier;
A Sid identifier is an identifier that will be prefixed by
lex_
or the prefix given in the -t option. Non prefixed identifiers will not be prefixed at all. The non prefixed form is considered obsolete and might be remove without notice. The first form of token definition states that upon encounteringtoken1
the lexer should return the terminal correspondingsid-identifier
.A return terminal command must appear last in the list of commands.
- The function call command
-
A function call command has the form
function-call-command := identifier + "(" + argument-list + ")" ;
An argument-list is a comma separated list. Here are examples of arguments:
TOKEN "[alpha1][digit]" -> get_identifier1(##); // Old form, will probably be obsoleted calls get_identifier1(c0,c1) TOKEN "[alpha2][digit]" -> get_identifier2(); // Old form equivalent to previous form will probably be obsoleted TOKEN "[alpha3]" -> get_identifier3(#$); // calls get_identifier3() with no argument TOKEN "[alpha4]a" -> get_identifier3(#); // calls get_identifier4() TOKEN "[alpha4][digit]b" -> get_identifier4("globalbuffer", #1,#0); // calls get_identifier4(globalbuffer,c1,c0) TOKEN "[alpha5]a" -> get_identifier5(#*); // should call get_identifier5(char*), not completely implemented yet.
where
c0, c1, c2, ...
matches the first, second and third character of the token. It is possible to combine the arguments (except for#$
which is only valid if it is the only argument of the list.The return value of a function will be ignored unless this is the last function call in the command list, in which case it is expected to return a valid terminal. If you don't want it the last called function to return a terminal, you have to add a trailing discard
$$
:TOKEN "[digit]" -> push_buffer(#0), $$; // do not return a terminal. TOKEN "[alpha]" -> get_identifier(#0); // return a terminal
- Inline action calls
-
An inline action call has the form
(/*commaseparatedoutputlist*/) = <action-name>( /* commaseparatedargumentlist*/);
>
> <
may or may not be removed from the syntax. The decision will happen prior to 2.0. The argumentlist can contain either local variables,#
style of arguments. Actions and types must have been previously declared see
See §5.2 for the C representation of the terminals returned by read_token()
.
The second form states that the lexer should return the result of the call to the given function identifier. See §5.3 for details of the function call made in C.
In more complex cases (most notably where tokens need include whitespace), tokens are mapped to user-defined functions. For example, for comments in a C-style language, the lexical analyser is expected to discard characters until the end of the comment is found. In Lexi, this is specified as:
TOKEN "/*" -> get_comment() ;
Where get_comment()
is an externally defined function which simply reads characters until the corresponding */
is found. See the functions section for further details of calling functions.
In most languages, keywords are usually a subset of identifiers. In order to handle these and simplify the user-defined read_identifier()
function (or equivalent), Lexi provides the keywords mechanism discussed in §3.6.
Note that this example does not illustrating storing the characters read. A real-world case would usually store spellings in order to be useful to a later stage, such as parsing.
Unlike many lexical analysers, tokens in Lexi are not specified by regular expressions. However, as sequences of characters may contain references to groups (which are treated as sets), similar effects may be achieved for simple cases. For example:
TOKEN "[alpha]" -> get_identifier (); TOKEN "$[alpha]" -> get_sid_identifier ();
assuming the group alpha
is defined. Another example:
TOKEN "A[digit]" -> $papersize ;
would match paper sizes such as such as A3
, A4
and so on. A token may only be defined once, but different tokens may share the same terminal or call the same function. So to extend our (rather lax) implementation of ISO 216 paper sizes:
TOKEN "A[digit]" -> $papersize ; TOKEN "A10" -> $papersize ; TOKEN "C[digit]" -> $envelopesize ; TOKEN "B[digit]" -> $envelopesize ; TOKEN "DL" -> $envelopesize ; /* BS 4264 specifies a transparent window for DL */ TOKEN "ID-[digit]" -> $identificationcardsize ;
Note that A10
codes for the same lexeme as single-digit A
sizes. See §2.4 for further examples of multiple tokens sharing one function, and the for further examples of using sets within sequences.
Using groups in this way is especially useful in combination with functions for reading variable-length tokens. For example:
TOKEN "[alpha]" -> get_identifier() ; TOKEN "$[alpha]" -> get_sid_identifier() ;
3.5. Default definition
A new feature in 2.0 is the ability to specify actions (usually token return) by default. I.e:
TOKEN "[alpha]" -> get_identifier() ; TOKEN DEFAULT -> $unknown ;
specify that the terminal $unknown
should be returned upon encountering a sequence of character that cannot be mapped to any other specified token.
3.6. Keyword definitions
The syntax of keywords resembles the syntax used for tokens:
keyword-defn := "KEYWORD" + string + "->" + either-identifier ;
For example:
KEYWORD "keyword" -> $key ; KEYWORD "special" -> get_special () ;
Usually keywords are simply identifiers with a special meaning. Using the main pass to identify keyword with the token constructs is possible but awkward since the spelling of keywords is usually a subset of the much bigger set of identifiers. The keyword construct facilitates the identification of keywords after a token has been found; they have effect only for the -k and are otherwise not present in the output generated by Lexi. Therefore the only difference between keywords and tokens (and indeed their purpose) is the programmatic interface that they provide.[a]
Code generated by Lexi under the -k option consists of a succession of calls to define each keyword, one per KEYWORD
statement:
MAKE_KEYWORD ( "keyword", "lex_key" ) MAKE_KEYWORD ( "special", "get_special ()" )
where the identifier has been transformed according to the rules for Sid identifiers. It is then left to the user to implement MAKE_KEYWORD
, usually by way of a macro. The generated keyword code is intended to be included with a #include
directive. Suppose that keyword.h contains the keyword code then building on existing token definitions, the intended use for keywords is as follows (for example with a lexer required to identify variable names):
KEYWORD "if" -> keyword_if ; KEYWORD "else" -> keyword_else ; TOKEN "[alpha]" -> get_variable() ;
Where and get_variable()
checks to see if the given token is actually a keyword like so:
<type> get_variable(int c) { char *token; /* token is pointed to the characters read */ ... #define MAKE_KEYWORD(A, B)\ if (!strcmp(token, (A))) return(B); #include "keyword.h" return(lex_variable); ... }
Here keyword.h was generated by Lexi's -k. If the variable name read by get_variable()
and pointed at by char *token
is a keyword, keyword.h's calls to MAKE_KEYWORD()
will result in the string comparisons of token
to each possible keyword in turn (that is, token
is compared to "if"
and "else"
. If either of these match, the identifiers keyword_if
and keyword_else
are returned, respectively. Otherwise, if no keywords match, the token is known to be a variable name, and so get_variable()
falls through to return the lex_variable
identifier.
Unlike functions associated with tokens, functions associated with keywords are generated to be called with no arguments passed:
KEYWORD "sx" -> fx() ;
Results in the generated call:
MAKE_KEYWORD("sx", fx());
And so f()
should be declared to accept no parameters, that is of prototype:
<type> f(void);
3.7. Zone Definitions
Zones are the major new feature for 2.0. A zone is a part of a file where lexing rules change. The general syntax is
ZONE zonename : "entrytoken" [ -> $entry-terminal] ... "leavingtoken" [ ->$leaving terminal] { /*Zone body*/ }
Upon encountering the entry token, the lexer change the lexing rules and use the ones defined in the zone body until it encounters the leaving token. Optionally, a terminal may be returned on either zone entry or zone leaving (or both) In the zone body, one can have whitespace definitions, group definitions, token definitions, default token definitions, mappings and other zone definitions. Keyword definitions are not allowed inside zones for now but it is a planned feature.
Zones are used in place of user defined functions. Their goal is to allow a better expressivity in the lxi language. For example, we can define comment using zones:
ZONE comment: "/*" ... "*/" { GROUP white = ""; TOKEN DEFAULT -> $$; /* $$ means discard the token */ } ZONE linecomment: "//" ... "\n" { GROUP white = ""; TOKEN DEFAULT -> $$; /* $$ means discard the token */ }
It is also possible to use zones to express strings:
ZONE string : "\"" -> clear_buffer(#$) [...] "\"" ->$string { GROUP white = ""; TOKEN DEFAULT -> push_buffer(#0), $$; }
It is planned to add buffers to lexi 2.1, removing the need for user function such as push_buffer. The range operators ...
and [...]
are equivalent. To express identifiers, one need to use the [...)
for which the leaving token is not considered a part of the zone:
ZONE string : "[alpha]" -> clear_buffer(#$) [...) "[^alphanum]" ->$identifier { GROUP white = ""; TOKEN DEFAULT -> push_buffer(#0), $$; }
Syntactic sugar for identifiers, comments and string should be added in either 2.1 or 2.2.
3.8. Type Definitions
We can declare types in lexi. This will be used for type checking of instructions (inline actions mostly, see ) that must happen upon encountering a token. Types declaration must only happen in the global zone. Here are examples:
TYPE buffer; TYPE integer;
There are several predefined types used by various action parameters; these need mapping to appropriate C types by typedef
if used. The predefined types are:
Type | Use |
---|---|
CHARACTER | TODO: $# etc |
STRING | TODO |
TERMINAL | TODO: $xyz etc |
3.9. Action prototypes
Any inline actions called inside a list of instructions must have been previously prototypes.
action-decl := "ACTION" + identifier [ + ":" + "(" + typelist+")" + "->" + "(" + typelist+")";
Here is an example:
ACTION actionname : (:TYPE1,:TYPE2) ->(:TYPE3,:TYPE4);
- [a]
This historic interface for producing keywords is expected to change drastically for the next version of Lexi.