What is a lexical token?
A lexical token is a sequence of characters that can be treated as a unit in the grammar of a programming language. A programming language classifies lexical tokens into a finite set of token types.
Known approaches to implementing a scanner:
- Using regular expressions that define tokens
- Through explicit control
The former is rather straightforward, and is the basis of popular scanner generators (lex, flex, etc). The latter, would mean explicitly running each possible transition of a DFA.
An example in pseudocode for an explicit control scanner:
CurrentChar <- Read()
if CurrentChar = '/':
then
CurrentChar <- Read()
if CurrentChar = '/'
repeat
CurrentChar <- Read()
until CurrentChar != {EOF, EOL}
else Error()
else Error()
if CurrentChar = EOL
then Accept(CurrentChar)
else Error()
For each of those, it is a useful practice to draw the NFA/DFA models (and leverage useful algorithms e.g addDFA and epislon-closure[1]) as a way help organize and structure the code.
While the code in Figures 2.5 and 2.6 serves to illustrate the nature of a scanner, we emphasize that the most reliable and expedient methods for constructing scanners do so automatically from regular expressions, as covered in Chapter 3. Such scanners are reasonably efficient and correct by construction, given a correct set of regular-expression specifications for the tokens.
Crafting Compiler
In any case, we want our lexer to extract tokens using a maximal munch heuristic.
Resource sink
Existing implementations
Theoretical reference
What is a lexical token?
Known approaches to implementing a scanner:
The former is rather straightforward, and is the basis of popular scanner generators (lex, flex, etc). The latter, would mean explicitly running each possible transition of a DFA.
An example in pseudocode for an explicit control scanner:
For each of those, it is a useful practice to draw the NFA/DFA models (and leverage useful algorithms e.g
addDFAandepislon-closure[1]) as a way help organize and structure the code.In any case, we want our lexer to extract tokens using a maximal munch heuristic.
Resource sink
Existing implementations
Theoretical reference