Top-down parsing

Parsing (or syntactic analysis) is the process of deriving a string in a context-free language.  A parse tree is created as the byproduct of the derivation.

In top-down parsing, the parse tree is constructed based on a leftmost derivation of the input (which is read left-to-right).  The parser begins with the sentence symbol S, and then applies productions, at each step deriving a sequence of terminals and nonterminals, until finally arriving at the input program.  The parse tree is constructed as the derivation steps are performed.

The critical issues in determining the algorithm to follow are:

1.  At each step of the derivation, which nonterminal should be chosen for expansion?  In a top-down parser, the answer here is always the leftmost nonterminal in the current symbol sequence.
2.  Once a nonterminal has been chosen for expansion, there may be a choice of several productions to apply.  Which production should be chosen?  Here the parser will attempt to make the correct choice by matching the input token(s) with the first few symbols of the right sides of the productions.

A top-down parser may be predictive or backtracking.  It may be stack-based or recursive.  The most common technique for building a hand-coded parser is the method of recursive descent, which is a top-down, predictive, recursive parser.

In a backtracking parser, the parser simply tries all the productions, one after the other.  If it gets stuck (i.e., if at some point the next input token does not match the next token in the production), it backs up and tries the next one.  (Of course, this may involve backing up in the input stream also, so the parser must save the input stream in a buffer as it proceeds.)  In a predictive parser, the grammar is designed so that the production choice can always be made by examining a finite number of input symbols.  In this case, if the parser gets stuck, it means that the program has a syntax error.

In a stack-based parser, the parser keeps the current symbol sequence on a stack, with the leftmost symbol on the top of the stack.  At each step of the algorithm, it examines the top of the stack.  If it is a terminal, it is matched with the next input token.  If it is a nonterminal, a production is chosen.  The symbols on the right side of the production are pushed onto the stack, right-to-left (so that the leftmost is on the top of the stack.)

A recursive descent parser performs the same basic operations as the stack-based parser, except that the stack is implicit in the run-time implementation of recursive procedures.  It works as follows:

For every nonterminal, there is a procedure whose job is to recognize that nonterminal in the input stream.  The return value of the procedure is a parse tree for a segment of the program, with the nonterminal at the root.  The procedure first decides which production to use (by examining the next input token).  It then has a step for each symbol in the right side of the production.  If the symbol is a terminal, the procedure matches it with the next input token.  If the symbol is a nonterminal, it calls the procedure associated with that nonterminal, receiving a tree in return.  When it finishes these steps, it can construct a tree node whose children are the terminals it has matched from the input and the subtrees received from calls to the various nonterminal procedures.  The procedure exits with this tree node as its return value.

For example, a procedure to recognize a while statement might look like this:

TreeNode parseWhile(){
  if(!nextToken.match(WHILE))
    error;
  lexer.getToken();
  if(!nextToken.match(LEFTP))
    error;
  lexer.getToken();
  TreeNode e = parseExpression();
  if(!nextToken.match(RIGHTP))
    error;
  TreeNode s = parseStatement();
  return new WhileNode(e,s);
}
 

Left recursion revisited

Now we can see more clearly why left recursion causes a problem with a predictive parser.  In the stack-based version of the algorithm, the left recursive production results in a procedure with an endless loop.  In the recursive version, the result is an endless series of recursive calls.

The technique of left recursion removal we looked at earlier will avoid this problem, but it may generate a parse tree which does not reflect the semantics of our language well.  An alternative is to use a loop instead of recursion for part of the algorithm.  Consider the productions for "expression":

expr -->  term | expr addop term

This is equivalent to

expr --> term  {  addop term  } *

where * represents zero or more repetitions, as in our notation for regular expressions.  This production can be translated into a recursive procedure as follows:

TreeNode parseExpression()
{
  TreeNode left = parseTerm();
  while ( nextToken is + or - ) {
      TreeNode right = parseTerm();
      left = new TreeNode(addop,left,right);
      }
  return left;
}