Skip to content

Lexer: Resources #5

Description

@ruiconti

What is a lexical token?

A lexical token is a sequence of characters that can be treated as a unit in the grammar of a programming language. A programming language classifies lexical tokens into a finite set of token types.

Known approaches to implementing a scanner:

  1. Using regular expressions that define tokens
  2. Through explicit control

The former is rather straightforward, and is the basis of popular scanner generators (lex, flex, etc). The latter, would mean explicitly running each possible transition of a DFA.

An example in pseudocode for an explicit control scanner:

CurrentChar <- Read()
if CurrentChar = '/':
then
    CurrentChar <- Read()
    if CurrentChar = '/'
        repeat
            CurrentChar <- Read()
        until CurrentChar != {EOF, EOL}
    else Error()
else Error()
if CurrentChar = EOL
then Accept(CurrentChar)
else Error()

For each of those, it is a useful practice to draw the NFA/DFA models (and leverage useful algorithms e.g addDFA and epislon-closure[1]) as a way help organize and structure the code.

While the code in Figures 2.5 and 2.6 serves to illustrate the nature of a scanner, we emphasize that the most reliable and expedient methods for constructing scanners do so automatically from regular expressions, as covered in Chapter 3. Such scanners are reasonably efficient and correct by construction, given a correct set of regular-expression specifications for the tokens.

Crafting Compiler

In any case, we want our lexer to extract tokens using a maximal munch heuristic.

Resource sink

Existing implementations

Theoretical reference

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions