Chapter 24 The Chomsky Hierarchy

Phrase-Structure Grammars (PSGs)

Think of the way in which a derivation is done with a context-free grammar. With context-free grammars we are always free to replace a nonterminal in a working string with the right-hand-side of any of its rules. It doesn't matter what is on either side of that nonterminal in the working string. In other words, the context of the nonterminal doesn't matter.

English and other natural languages are not built this way. In a natural language the context in which a word is used may change the meaning of the word. As a matter of fact, even programming languages such as C++ are not completely context-free. If you use the string "<<" in a C++ program and there is an ostream to its left, then this string is the insertion operator. But if "<<" has an integer to its left, then it is the bitwise left-shift operator.

For each class of languages prior to the recursive languages we saw both an acceptor (a machine) and a generator (a regular expression or a grammar.) But for the recursive and recursively enumerable languages we have seen only the acceptors. We shall now study grammars for languages accepted by Turing machines, beginning with grammars for the r.e. languages.

The grammars we study in this chaper are called phrase-structure grammars. A phrase-structure gramar is defined as follows:

a finite alphabet of terminals
a finite set of symbols called nonterminals that includes the starting nonterminal, S
a finite list of productions of the form string1 string2 where string1 is any string of terminals and nonterminals that must include at least one nonterminal and string2 is any string of terminals and nonterminals whatsoever.

A derivation with a phrase-structure grammar is done the same way as a derivation with a context-free grammar. The language generated by the PSG is an r.e. language, although this remains to be proven. Here is an example:

S     XS
     |  
X     aX
     |  a
aaaX  ba

Here is a derivation made from this grammar. At each step the set of symbols that gets replaced in the next step is underlined.

S  XS
   XXS
   XXXS
   XXX
   aXXX
   aaXXX
   aaaXXX
   baXX
   baaXX
   baaaXX
   bbaX
   bbaa

Notice that using the rule aaaX

ba makes the working string get shorter! The only rules in context-free grammars that caused working strings to get shorter were

-rules. With PSGs, terminals in working strings can disappear. Interestingly, leftmost derivations are not always possible with these grammars.

Note that a context-free grammar is a phrase-structure grammar. You can see this if you review the restrictions on phrase-structure grammars. So languages that can be generated by context-free grammars are a subset of languages that can be generated by phrase-structure grammars.

Recall that the language {aⁿbⁿaⁿ | n > 0} is not a context-free language. We can generate this language with a phrase-structure grammar:

S    aSBA
    |  abA
AB   BA
bB   bb
bA   ba
aA   aa

Here are some derivations using this grammar, first a derivation of aba and then a derivation of aabbaa.

S  abA
   aba

S  aSBA
   aabABA
   aabBAA
   aabbAA
   aabbaA
   aabbaa

We use the S rules to build up the leftmost group of a's in our working string, using the rule S

abA last. At this point our working string will contain a group of a's followed by one b and a sequence of alternating A's and B's beginning with an A and ending with an A. The next thing we need to do is shuffle the A's and B's using the AB

BA rule until all the B's come before all the A's. Then we begin turning A's into a's and B's into b's using the rules bB

bb, bA

ba, and aA

aa.

It is possible to "get stuck" during a derivation with a PSG by using rules in an order that leads nowhere. For example, consider what happens if we do not shuffle the A's and B's before we change them to little letters.

S  aSBA
   aabABA
   aabaBA
   ?

Now we're stuck. We can draw a total language tree for a phrase-structure grammar but some branches will end with working strings that cannot be used to derive any words. In the tree for the above grammar, one branch will end with the working string aabaBA.

Here is an interesting theorem.

Theorem. Given a PSG for language L, there is another PSG for L which has the same alphabet of terminals and whose rules are all of the form

string of nonterminals  string of terminals and nonterminals

where the left side cannot be

, but the right side can.

Proof. We give an algorithm for changing any PSG into a PSG meeting the restriction that the left-hand-sides of its rules contain only nonterminals. Make up a new nonterminal for each terminal and add one rule for each, rules such as

N₁  a
N₂  b
etc.

Then change all occurrences of each terminal into the nonterminal corresponding to that terminal. QED

The first grammar below is changed into the second by this algorithm:

S     XS
     |  
X     aX
     |  a
aaaX  ba


S     XS
     |  
X     N₁X
     |  N₁
N₁N₁N₁X  N₂N₁
N₁    a
N₂    b

A grammar of the form just described, in which each rule's left-hand-side is a string of only nonterminals, is called type 0 (type zero.) Because of the previous theorem we know that the phrase-structure grammars and the type 0 grammars are equivalent. Type 0 is one of the four classes of grammars that Noam Chomsky defined in 1959. The table below summarizes Chomsky's Hierarchy:

Type Name Restrictions on Rules
X Y Acceptor

0 Phrase-structure =
Recursively Enumerable X = string of nonterminals
Y = any string
TM

1 Context-sensitive X = string of nonterminals
Y = any string with length at least the length of X LBA (linear-bounded automaton, which is a TM with bounded tape)

2 Context-free X = one nonterminal
Y = any string PDA

3 Regular X = one nonterminal
Y = tN or Y = t, where t is a terminal and N is a nonterminal FA

Chomsky's descriptions of language classes match nicely with some of the ones we have studied. We have completely described types 2 and 3, although Chomsky did not subdivide his type 2 languages into deterministic and nondeterministic the way we did. Type 1 is a class that computer theorists don't work with very often, but we suspect that natural languages fall into this category. Note that the type 1 category does not correspond exactly to the class of recursive languages, but is a subclass of those languages. In other words, type 1 languages are recursive but not all recursive languages are type 1 languages. We can look at the rules of a grammar and determine that the grammar generates a context-sensitive (type 1) language, but it is impossible to look at the rules of a grammar and tell that the grammar generates a recursive language.

Here is our more detailed hierarchy of languages.

Our Hierarchy of Languages

No two of these classes are the same. We have placed a language in each class that is not a member of any of that class's subclasses. Note that the language L in the class of recursive languages is defined by our author in this chapter, but we will not discuss it. Suffice it to say that there are languages (like L) that are recursive but not context-sensitive.

Type 0 = Recursively Enumerable = Recognizable by TM

We have yet to prove that the set of languages that can be generated by phrase-structure grammars is the same set of languages that can be recognized by Turing machines, but this is true. The author gives algorithms to convert a phrase-structure (type 0) grammar into a Turing machine and to convert a Turing machine into a phrase-structure grammar. We will not learn these algorithms but will take them as proof that phrase-structure grammars and Turing machines correspond to the same set of languages.

Closure Properties of R.E. Languages

We proved in chapter 23 that the set of r.e. languages is closed under union and intersection. We did this by using the Turing machine for each of two languages and showing how we could build another Turing machine that was a union machine or an intersection machine. In this chapter we prove (using grammars) that the set is also closed under concatenation and Kleene star. We know from our work with the language, ALAN, and its complement, CWL'

MATHISON, that the set of r.e. languages is not closed under complement.

Theorem. Given L₁ and L₂, two r.e. languages, the language L₁L₂ is also r.e. In other words, the set of r.e. languages is closed under concatenation.

Proof. The proof we used to show that the set of context-free languages is closed under concatenation almost works here, except that we need to subscript both the terminals and the nonterminals in each of our original grammars to keep from mixing up the rules of the different grammars. Refer back to that proof in chapter 17. QED

Example: Consider the language LL where language L is defined by this grammar:

S   a
aS  b

Note that the only string that this grammar can produce is the string a, so the only string that we should produce by concatenating the languages is the string aa. We create two copies of this grammar, one subscripted with 1 and the other with 2. Think what would happen if we only subscripted the nonterminals of the grammars. We could do this derivation:

S  S₁S₂
   aS₂
   b

To avoid such mistakes, we subscript both terminals and nonterminals, as follows:

S₁    a₁
a₁S₁  b₁
 
S₂    a₂
a₂S₂  b₂

Now we build a new grammar by introducing a new nonterminal S and the rule S

S₁S₂. We also have to add the productions a₁

a, b₁

b, a₂

a, and b₂

S    S₁S₂
S₁    a₁
a₁S₁  b₁
a₁    a
b₁    b
S₂    a₂
a₂S₂  b₂
a₂    a
b₂    b

Theorem. If L is an r.e. language, then L* is also r.e. The set of recursively enumerable languages is closed under Kleene star.

Proof. Here we have to be careful not to allow strings of terminals from one occurrence of a word in L to attach themselves to a part of the derivation from another word in L, and possibly produce a word not in L*. We need to make two copies of our grammar, one subscripted with a 1 and one subscripted with a 2, then alternate between these grammars when producing strings in L*. We then add the rules

S  S₁S₂S
  | S₁
  |

We need to subscript both terminals and nonterminals as we did in the previous proof. QED

We have finished our study of languages for this semester. Here is a table to summarize our findings.

Type of Language Language Defined By Corresponding Acceptor Does Nondeterminism
Give More Power? Set Is Closed Under

regular language regular expression finite automaton, transition graph no union,
concatenation,
Kleene star,
intersection,
complement

context-free language context-free grammar pushdown automaton yes concatenation,
Kleene star,
union

recursive language phrase-structure grammar Turing machine
that never loops no concatenation,
Kleene star,
union,
intersection,
complement

recursively enumerable language phrase-structure grammar Turing machine no concatenation,
Kleene star,
union,
intersection

Type	Name	Restrictions on Rules X Y	Acceptor
0	Phrase-structure = Recursively Enumerable	X = string of nonterminals Y = any string	TM
1	Context-sensitive	X = string of nonterminals Y = any string with length at least the length of X	LBA (linear-bounded automaton, which is a TM with bounded tape)
2	Context-free	X = one nonterminal Y = any string	PDA
3	Regular	X = one nonterminal Y = tN or Y = t, where t is a terminal and N is a nonterminal	FA

Type of Language	Language Defined By	Corresponding Acceptor	Does Nondeterminism Give More Power?	Set Is Closed Under
regular language	regular expression	finite automaton, transition graph	no	union, concatenation, Kleene star, intersection, complement
context-free language	context-free grammar	pushdown automaton	yes	concatenation, Kleene star, union
recursive language	phrase-structure grammar	Turing machine that never loops	no	concatenation, Kleene star, union, intersection, complement
recursively enumerable language	phrase-structure grammar	Turing machine	no	concatenation, Kleene star, union, intersection