cs.lectures: 08/02/21

Quick Recap

Starting with what is compiler, compiler is a Computer software that translates (compiles) source code written in a high-level language (e.g., C++) into a set of machine-level language. There are various phases, mainly analysis phase & synthesis phase, these phases are again divided into 6 phases namely- lexical, syntax, semantic, ICG, CG, CO.

The lexical analyzer reads the stream of characters making up the source program and groups the characters into meaningful sequences called lexemes , and the o/p will be tokens, coming to 2nd phase, syntax analysis, which create a tree-like representation that depicts the grammatical structure of the token stream. 3rd phase- semantic analysis, check the source program for semantic consistency.

Coming to the 4th phase, ICG

In the analysis-synthesis model of a compiler, the front end analyzes a source
program and creates an intermediate representation, from which the back end
generates target code. Ideally, details of the source language are confined to the
front end, and details of the target machine to the back end. With a suitably
defined intermediate representation, a compiler for language i and machine j
can then be built by combining the front end for language i with the back
end for machine j . This approach to creating suite of compilers can save a
considerable amount of effort: m x n compilers can be built by writing just m
front ends and n back ends.

Static checking includes type checking , which ensures that operators are applied to compatible operands. It also includes any syntactic checks that remain after parsing. For example, static checking assures that a break-statement in C is enclosed within a while-, for-, or switch-statement; an error is reported if such an enclosing statement does not exist.

In the process of translating a program in a given source language into code for a given target machine, a compiler may construct a sequence of intermediate representations.

Why do we need intermediate code of representation?

Intermediate code eliminates the need of a new full compiler for every unique machine by keeping the analysis portion same for all the compilers.

#High-level representations are close to the source language and low-level representations are close to the target machine. Syntax trees are high level; they depict the natural hierarchical structure of the source program and are well suited to tasks like static type checking.

#A low-level representation is suitable for machine-dependent tasks like register allocation and instruction selection. Three-address code can range from high- to low-level, depending on the choice of operators. For expressions, the differences between syntax trees and three-address code are superficial.

Different forms of ICG/ Intermediate representations in use

1. Post-fix notation - notation where the operator comes after the operands. eg: x y +

2. Syntax tree

3. Three address code

#Variants of Syntax Trees

Nodes in a syntax tree represent constructs in the source program.
The children of a node represent the meaningful components of a construct.
A directed acyclic graph (hereafter called a DAG ) for an expression identifies the common subexpressions (subexpressions that occur more than once) of the expression.

Directed Acyclic Graphs for Expressions

Directed Acyclic Graph (DAG) is a special kind of Abstract Syntax Tree.
Each node of it contains a unique value.
It does not contain any cycles in it, hence called Acyclic.
A DAG is usually constructed using Three Address Code.

Example: ( a + b ) x ( a + b + c )

Three Address Code for the given expression is-

T1 = a + b

T2 = T1 + c

T3 = T1 x T2

Now, Directed Acyclic Graph is-

From the constructed DAG, we observe-

The common sub-expression (a+b) has been expressed into a single node in the DAG.
The computation is carried out only once and stored in the identifier T1 and reused later.

This illustrates how the construction scheme of a DAG identifies the common sub-expression and helps in eliminating its re-computation later.

The Value-Number Method for Constructing DAG's

Often, the nodes of a syntax tree or DAG are stored in an array of records.
Each row of the array represents one record, and therefore one node.
In each record, the first field is an operation code, indicating the label of the node.

Variants of Syntax Trees

In Fig. 6.6(b), leaves have one additional field, which holds the lexical value (either a symbol-table pointer or a constant, in this case), and interior nodes have two additional fields indicating the left and right children.

In this array, we refer to nodes by giving the integer index of the record
for that node within the array. This integer historically has been called the
value number for the node or for the expression represented by the node.

Algorithm 6.3: The value-number method for constructing the nodes of a DAG.

INPUT: Label op , node l , and node r .
OUTPUT: The value number of a node in the array with signature h op ; l; r i .
METHOD:

Search the array for a node M with label op , left child l , and right child r .
If there is such a node, return the value number of M .
If not, create in the array a new node N with label op , left child l , and right child r , and return its value number.

While Algorithm 6.3 yields the desired output, searching the entire array
every time we are asked to locate one node is expensive, especially if the array
holds expressions from an entire program.

A more efficient approach is to use a hash table, in which the nodes are put into "buckets", each of which typically will have only a few nodes. The hash table is one of several data structures that support dictionaries efficiently.

A dictionary is an abstract data type that allows us to insert and delete elements of a set, and to determine whether a given element is currently in the set. A good data structure for dictionaries, such as a hash table, performs each of these operations in time that is constant or close to constant, independent of the size of the set.

To construct a hash table for the nodes of a DAG,

we need a hash function h that computes the index of the bucket for a signature h op ; l; r i , in a way that distributes the signatures across buckets, so that it is unlikely that any one bucket will get much more than a fair share of the nodes. The bucket index h ( op ; l; r ) is computed deterministically from op , l , and r , so that we may repeat the calculation and always get to the same bucket index for node h op ; l; r i .

The buckets can be implemented as linked lists, as in Fig. 6.7. An array,
indexed by hash value, holds the bucket headers , each of which points to the
first cell of a list. Within the linked list for a bucket, each cell holds the value
number of one of the nodes that hash to that bucket. That is, node h op ; l; r i
can be found on the list whose header is at index h ( op ; l; r ) of the array.

cs.lectures

Intermediate Code Generation || Directed Acyclic Graph (DAG)

Different forms of ICG/ Intermediate representations in use

1. Post-fix notation - notation where the operator comes after the operands. eg: x y +

2. Syntax tree

DAG

3. Three address code

Monk and Inversions