(This section involves a more advanced topic in XSB programming, metainterpretation. It may be helpful to read the later section on metainterpreters if the going gets tough here.)
As discussed above, parsing context-free grammars with tabling results in an
Earley-type parser. This is the parser we get when we write DCGs. We can also
write an XSB program that takes a grammar as input (defined in a database
predicate as in the example of first/3) and a string (defined by
the database word/3 predicate) and succeeds if the grammar accepts
the string. With such processing we can compute and use the FIRST sets to
make the grammar processing more deterministic. This is similar to what is done
in LL(k) and LR(k) parsing, but there the emphasis is on compile-time analysis
and complete determinacy. Here the approach is more interpretive and supports
nondeterminacy. However, if the grammars are indeed of the appropriate form
(LL(k) or LR(k)), the corresponding interpreters presented here will have linear
complexity.
It is very easy to write a simple context-free grammar parser in XSB. Again,
we assume that the grammar is stored in facts of the form NT ==>
Body where NT is a nonterminal symbol and Body
is a list of terminals and nonterminals.
:- table parse/3.
% parse(Sym,S0,S) if symbol Sym generates the string from S0 to S.
parse(Sym,S0,S) :-
word(S0,Sym,S).
parse(Sym,S0,S) :-
(Sym ==> Body),
parseSF(Body,S0,S).
% parseSF(SF,S0,S) if sentential form SF generates the string from S0 to S.
parseSF([],S,S).
parseSF([Sym|Syms],S0,S) :-
parse(Sym,S0,S1),
parseSF(Syms,S1,S).
The predicate parse/3 recognizes strings generated by a single
grammar symbol; the first clause recognizes terminal symbols directly and the
second clause uses parseSF/3 to recognize strings generated by the
sentential form that makes up the body of a rule for a nonterminal.
parseSF/3 simply maps parse/3 across the sequence of
grammar symbols in its sentential form argument.
Were we not to add the table declaration, we would get a recursive descent recognizer. But with the table declaration, we get the Earley-type recognizer of XSB as described above. The tabling reflects right through the programmed recognizer.
Next we add look-ahead to this recognizer by computing FIRST
sets and making calls only when the next symbols are in the first set of the
sentential form to be processed. This will give us an LL(k)-like recognizer.
Hovever, the form of FIRST we need here is slightly different from
the one above. Here we want to include the context in which a First
set is computed. For example, we may want to compute the FIRST_2
set of a symbol N, but N only generates one symbol.
The definition of first above would return a list of length one for
that symbol. Here we want to take into account the context in which
N is used. For example, we may know that in the current context
N is immediately followed by another nonterminal M,
and we know the FIRST of M. Then we can compute the
FIRST_2 of N in the following context of
M by extending out the one symbol in first of
N with symbols in FIRST of M.
The following definition of firstK/3 computes such first sets.
:- table firstK/3.
% firstK(+SF,+Follow,-First), where K = length(Follow)
firstK([],Follow,Follow).
firstK([Sym|SF],Follow,First) :-
firstK(SF,Follow,SymFollow),
((Sym ==> Body)
-> firstK(Body,SymFollow,First)
; First = [Sym|FirstTail]
append(FirstTail,[_],SymFollow),
).
The predicate firstK/3 takes a sentential form SF,
a follow string Follow, and returns in First a first
string of SF in the context of the follow string. The value of
k is taken to be the length of the list Follow; this will be
the length of the look-ahead.
We can now extend our parser to have a top-down look-ahead component, similar to an LL(k) recognizer:
TEST THIS SUCKER!
:- table parseLL/4. % without this, it is an LL(k)-like parser,
% but with it, what is it???
% parseLL(Sym,Follow,S0,S) if symbol Sym generates the string from S0 to S.
parseLL(Sym,Follow,S0,S) :-
(Sym ==> Body)
-> firstK(Body,Follow,First),
next_str(First,S0), % do the look-ahead, continuing only if OK
parseLLSF(Body,Follow,S0,S)
; word(S0,Sym,S).
% parseLLSF(SF,Follow,S0,S) if sentential form SF generates the string from S0 to S.
parseLLSF([],_Follow,S,S).
parseLLSF([Sym|Syms],Follow,S0,S) :-
firstK(Syms,Follow,SymFollow),
parseLL(Sym,SymFollow,S0,S1),
parseLLSF(Syms,Follow,S1,S).
next_str([],_).
next_str(['$'|_],S) :- \+ word(S,_,_). % $end of string
next_str([Sym|Syms],S) :- word(S,Sym,S1),next_str(Syms,S1).
The predicate parseLL/4 recognizes a string generated by a
grammar symbol, using lookahead. The condition tests whether Sym is
a nonterminal, and if it is and can be rewritten as Body, it checks
to see whether the next symbols in the input string belong to the first set of
the body of that rule. If not, it needn't process that rule because it cannot
succeed. For an LL(k) grammar, only one rule will ever apply; the others all
will be filtered out by this lookahead. So in this case a rule is never tried
unless it is the only rule that might lead to a parse. If the symbol is a
terminal, it simply checks to see whether the symbol matches the input.
parseLLSF/4 maps parseLL/4 across a sequence of
grammar symbols in a sentential form. It uses firstK/3 to compute
the follow strings of a symbol, which are needed in parseLL to
compute the first strings in the correct context.
In this LL(k)-like parser, we tested that the next symbols were in the first
set just before we called parseLLSF. We could also check the
lookahead just before returning. This gives us an LR(k)-like parser, as follows:
% parseLR with tabling to get an lr(k)-like algorithm.
:- table parse/4.
parseLR(Sym,Follow,Str0,Str) :-
word(Str0,Sym,Str),
next_str(Follow,Str).
parseLR(Sym,Follow,Str0,Str) :-
(Sym ==> RB),
parseLRSF(RB,Follow,Str0,Str).
parseLRSF([],Follow,Str,Str) :-
next_str(Follow,Str).
parseLRSF([Sym|SF],Follow,Str0,Str) :-
firstK(SF,Follow,SymFollow),
parseLR(Sym,SymFollow,Str0,Str1),
parseLRSF(SF,Follow,Str1,Str).
For this parser, we compute a follow string for a sentential form, and only return from parsing that sentential form if the next input symbols match that follow string. For an LR(k) grammar, the parser will not fail back over a successful return (unless the entire string is rejected.) This allows the parser to work in linear time.
The above program was written as a Prolog program, but we can write the identical program as a DCG and have the DCG transformation put in the string variables, as follows:
:- table parseLR/4.
parseLR(Sym,Follow) --> [Sym], look(Follow).
parseLR(Sym,Follow) -->
{(Sym ==> RHS)},
parseLRSF(RHS,Follow).
parseLRSF([],Follow) --> look(Follow).
parseLRSF([Sym|SF],Follow) -->
{firstK(SF,Follow,SymFollow)},
parseLR(Sym,SymFollow),
parseLRSF(SF,Follow).
look([]) --> [].
look([Word|Words]), [Word] --> [Word], look(Words).
This recognizer differs from an LR(k) recognizer in that it computes the
look-ahead strings as needed. Also it processes each look-ahead string
separately; i.e., Follow is a single string, not a set of strings.
In an LR(k) recognizer, the lookahead tables are computed once and stored.