Lexical Analysis

The following contents are my reading notes for Compiler (the Dragon Book)-Chapter III. An implementation of the Lexical analyzer is provided in the last section.

The Role of the Lexical Analyzer

The very first layer of the front end of a compiler’s architecture

Lexical Analysis Versus Parsing

The reason why the analysis portion of a compiler is normally separated into two lexical analysis and parsing:

(Most Importantly) Simplify the design of compile
Improve compiler efficiency
enhance compiler portability

Tokens, Patterns, and Lexems

TOKEN: a pair consisting of a token name and optional attribute value
PATTERN: a description of the form that lexemes of a token may take
LEXEME: s sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token

examples: see page 112.

Attributes for Token

WHY WE NEED ATTRIBUTES: Since more than one lexemes can match a single pattern, such as both 0 and 1 matches number, it is necessary to use an attribute for a token to distinguish different concrete lexemes
IMPORTANT EXAMPLE: The most important example is the token id, which we associate with lots of information. Normally the information about an id, such as {lexemes, type, location, …}, is kept in the symbol table. Thus the proper attribute value for an id is a pointer to the symbol-table entry for that identifier

Lexical Errors

Errors like the typo ”fi”:
- This is not the error we are discussing here!
- because: the lexical analyzer cannot tell whether fi is a misspelling keyword if, or an undeclared function identifier. Since **fi *************is a valid lexeme for the token id. Thus, the lexeme analyzer must return the token id to the parser and let some other phases of the compiler handle such errors.
Errors like the lexical analyzer being unable to proceed due to none of the patterns for tokens matching any prefix of the remaining input
- These are the errors we are discussing here and there’re several error-recovery actions:
- delete successive character
- delete one character
- insert a missing character
- replace a character
- transpose two adjacent characters

Input Buffering

Buffer Pair techniques:|——BUFFER_i———|——BUFFER_ii———|maintaining two pointers:
- Pointer lexemeBegin, marks the beginning of the current lexeme
- Pointer forward scans ahead until a pattern match is found
By using such technique, for each character read we have to make two tests: end of buffer? && what char is read?

Improvement SENTINELS:Combining two tests by inserting extra EOF at end of each buffer

switch(*forward++) {
	case eof:
		if(forward is at end of first buffer) {
				reload second buffer;
				forward = beginning of the second buffer;
		}
		else if(forward is at the end of second buffer) {
				reload first buffer;
				forward = beginning of the first buffer;
		}
		else /* end of the input source, terminate lexical analysis*/
}

Specification of Tokens

a regular expression, aka regex, is important in specifying patterns we actually need for tokens.

String and languages

STRING: a string over an alphabet $\Sigma$ is a finite sequence of symbols drawn from that alphabet.
LANGUAGE: a language is any countable set of strings over some fixed alphabet.

Operations on Languages

Union $L\cup M = \{s | s \in L \vee s \in M\}$
Concatenation $LM = \{ st | s \in L \wedge m \in M\}$
Kleene closure $L^*$
Positive closure $L^+$

notice that $LM \neq ML$

Regular Expression

BASIS
INDUCTION

Extension to Regex

one or more instances: +
zero or one instance: ?
character classes: [0,1,2,3] or [0-3]

Recognition of Tokens

Transition Diagram

the diagram has a collection of nodes or circles, called state
Important Conventions of diagram:
- certain states are said to be accepting or final state
- if it’s necessary to retract forward pointer one position, we additionally place a * near the accepting state
- One state is designated the start state or initial state

Reserved Words and Identifier

reserved words and identifiers are very similar, and there are two ways to distinguish them:

place reserved words in the symbol table beforehand
prioritize diagrams for each keyword so that reserved-word tokens are recognized in preference to id

Architecture of a Transition-Diagram-Based Lexical Analyzer

such one diagram can be written as a C++ function that returns a structure containing a token and attribute. The function is basically a switch statement or multiway branch that determines the next state.

Examples, see Page135

How do these diagrams work in the big picture of the analyzer?

one by one, but prioritize keyword diagram
“in parallel”, take the longest prefix of the input that matches any pattern
combining diagrams altogether, take the longest prefix of the input that matches any pattern (this combining process can be hard)

LEX

Use of Lex (Pipeline)

Lex source program (lex.l) → [Lex Compile] → lex.yy.c

lex.yy.c → [C compiler] → a.out

Input stream → [a.out] → Sequence of tokens

Structure of Lex Programs

There are basically three sections separated by %% in lex.l file:

declarations
translate rules
auxiliary functions

The I and III sections will be copied directly to lex.yy.c.

Conflict Resolution in Lex

always prefers a longer prefix to a shorter prefix
If the longest possible prefix matches two or more patterns, prefer the pattern listed first in the Lex Program

The lookahead operator /

what follows / is additional pattern that must be matched before we can decide that the token in question was seen, but what matches this second pattern is not part of the lexeme.

Finite Automata

automata are essentially graphs, with few diffs:

finite automata are recognizers, simply say “yes” or “no” about each possible input string
Finite automata come in two flavors
- NFA
- DFA

NFA

consists of:

A finite SET of states S
A set of input symbols $\Sigma$
A transition function
A state $s_0$ from S that is distinguished as the start state
A set of states $F$, a subset of S, that is distinguished as the accepting states.

This graph is very much like a transition graph except:

one symbol can label edges from one state to several states
an edge may be labeled by $\epsilon$, the empty string $\phi$, symbols from $\Sigma$

Transition Tables

Instead of representing finite automata as a graph, we can use a transition table.

Example, see Page 149

Acceptance of Input String by Automata

NFA accepts a string as long as exists one path labeled by that string leads from the start state to an accepting state.
$L(A)$ stands for the language accepted by automaton $A$

DFA

A special case of NFA.

no move on input $\epsilon$
for each state s and input symbol a, there is exactly one edge out of s labeled a.

Lexical Analysis

Contents:

The Role of the Lexical Analyzer

Lexical Analysis Versus Parsing

Tokens, Patterns, and Lexems

Attributes for Token

Lexical Errors

Input Buffering

Specification of Tokens

String and languages

Operations on Languages

Regular Expression

Extension to Regex

Recognition of Tokens

Transition Diagram

Reserved Words and Identifier

Architecture of a Transition-Diagram-Based Lexical Analyzer

LEX

Use of Lex (Pipeline)

Structure of Lex Programs

Conflict Resolution in Lex

The lookahead operator /

Finite Automata

NFA

Transition Tables

Acceptance of Input String by Automata

DFA

Comments

Leave a Reply Cancel reply

More posts

Notes of “The Age of AI has begun”

Lexical Analysis

Image Abstraction using Layer-wise Image Vectorization

More about Matrix – Differentiation