Engineering a Compiler by Keith Cooper, Linda Torczon

February 21, 2011
Publisher: Morgan Kaufmann
ISBN: 978-0120884780



  • Compilation
  • Scanners
  • Parsers - a parser takes a string of characters and turns it into an abstract syntax tree, a collection of objects representing the grammatical structure of the text.
  • Context-sensitive analysis
  • Intermediate representations
  • Procedure abstraction
  • Optimization
  • Data-flow analysis
  • Scalar optimization
  • Instruction selection and scheduling
  • Register allocation

Compilers are computer programs that translate a program written in one language into a program written in another language. The compiler has a front end to deal with the source language and a back end to deal with the target language. Connecting the front end and the back end, it has a formal structure for representing the program in an intermediate representation (IR) whose meaning is independent of either language.

In a two-phase compiler, the front end must ensure that the source program is well formed, and it must map that code into the IR. The back end must map the IR program into the instruction set and the finite resources of the target machine. Because the back end only processes IR created by the front end, it can assume that the IR contains no syntactic or semantic errors.

The compiler writer can insert a third phase between the front and back end, the optimizer. The optimizer takes an IR program as its input and produces a semantically equivalent IR program as its output. The optimizer is an IR-to-IR transformer that tries to improve the IR program in some way.

Fundamental principles of compilers:

  • The compiler must preserve the meaning of the program being compiled.
  • The compiler must improve the input program in some discernible way.

To check the syntax of the input program, the compiler must compare the program's structure against a definition for the language. Mathematically, the source language is a set, usually infinite, of strings defined by some finite set of rules, called a grammar. Two passes in the front end, the scanner and the parser, determine whether or not the input code is a member of the set of valid programs defined by the grammar.

The scanner identifies distinct words in the input program and classifies each word with a part of speech. It takes a stream of characters and converts it to a stream of classified words i.e., pairs (p, s) where p is the word's part of speech and s is its spelling, for example: (noun, "Ala"), (verb, "ma"), (adjective, "czarnego"), (noun, "kota"), (endmark, "."). The scanner applies a set of rules that describe the lexical structure of the input programming language called microsyntax. The microsyntax of a programming language specifies how to group characters into words and how to separate words that run together. To recognize keywords, the scanner can either use dictionary lookup or encode the keywords directly into its microsyntax rules.

The parser tries to match the stream of categorized words against the rules that specify syntax for the input language. In other words, the parser determines if the input stream is a sentence in the source language.

REP: The scanner's task is to transform a stream of characters into a stream of words in the input language. A hand-crafted scanner can be faster than a generated one because the implementation can optimize away a portion of the overhead that cannot be avoided in a generated scanner.

The recognizer is a program that identifies specific words in a stream of characters. The simplest algorithm to recognize words is a character-by-character formulation. A transition diagram represents the recognizer. The transition diagram is comprised of an initial state, an accepting state, intermediate states, and transitions from state to state based on the input character. To recognize multiple words, we can create multiple transitions that leave a given state. Finite automata specify recognizers.

A finite automaton (FA) is a formalism for recognizers that has a finite set of states, an alphabet, a transition function, a start state, and one or more accepting states. A transition diagram is equivalent of the corresponding finite automaton. We van simplify the FA significantly if we allow the transition diagram to have cycles.

For any FA, we can describe its language using a notation called a regular expression (RE). The language described by an RE is called regular language. An RE is built up from three basic operations: alternation, concatenation, and the Kleene closure. Parentheses have the highest precedence, followed by closure, concatenation, and alternation.

IMPORTANT: The cost of operating an FA is proportional to the length of the input, not to the length or complexity of the RE that generates the FA.

Regular expressions are closed under many operations - that is, if we apply the operation to an RE or a collection of REs, the result is an RE.