Tree Selection

Next: Tree Database Up: System Description Previous: System Description

Tree Selection

Since we are working with lexicalized TAGs, each word in the sentence selects at least one tree. The advantage of a lexicalized formalism like LTAGs is that rather than parsing with all the trees in the grammar, we can parse with only the trees selected by the words in the input sentence. In the XTAG system, the selection of trees by the words is done in several steps. Each step attempts to reduce ambiguity, i.e. reduce the number of trees selected by the words in the sentence.

Morphological Analysis and POS Tagging: The input sentence is first submitted to the Morphological Analyzer and the Tagger. The morphological analyzer ([#!karp92!#]) consists of a disk-based database (a compiled version of the derivational rules) which is used to map an inflected word into its stem, part of speech and feature equations corresponding to inflectional information. These features are inserted at the anchor node of the tree eventually selected by the stem. The POS Tagger can be disabled in which case only information from the morphological analyzer is used. The morphology data was originally extracted from the Collins English Dictionary ([#!ced79!#]) and Oxford Advanced Learner's Dictionary ([#!oald74!#]) available through ACL-DCI ([#!liberman89!#]), and then cleaned up and augmented by hand ([#!karp92!#]).
POS Blender: The output from the morphological analyzer and the POS tagger go into the POS Blender which uses the output of the POS tagger as a filter on the output of the morphological analyzer. Any words that are not found in the morphological database are assigned the POS given by the tagger.
Syntactic Database: The syntactic database contains the mapping between particular stem(s) and the tree templates or tree-families stored in the Tree Database (see Table 3.1). The syntactic database also contains a list of feature equations that capture lexical idiosyncrasies. The output of the POS Blender is used to search the Syntactic Database to produce a set of lexicalized trees with the feature equations associated with the word(s) in the syntactic database unified with the feature equations associated with the trees. Note that the features in the syntactic database can be assigned to any node in the tree and not just to the anchor node. The syntactic database entries were originally extracted from the Oxford Advanced Learner's Dictionary ([#!oald74!#]) and Oxford Dictionary for Contemporary Idiomatic English ([#!cie75!#]) available through ACL-DCI ([#!liberman89!#]), and then modified and augmented by hand ([#!EgediMartin94!#]). There are more than 31,000 syntactic database entries.^3.1 Selected entries from this database are shown in Table 3.2.
Default Assignment: For words that are not found in the syntactic database, default trees and tree-families are assigned based on their POS tag.
Filters: Some of the lexicalized trees chosen in previous stages can be eliminated in order to reduce ambiguity. Two methods are currently used: structural filters which eliminate trees which have impossible spans over the input sentence and a statistical filter based on unigram probabilities of non-lexicalized trees (from a hand corrected set of approximately 6000 parsed sentences). These methods speed the runtime by approximately 87%.
Supertagging: Before parsing, one can avail of an optional step of supertagging the sentence. This step uses statistical disambiguation to assign a unique elementary tree (or supertag) to each word in the sentence. These assignments can then be hand-corrected. These supertags are used as a filter on the tree assignments made so far. More information on supertagging can be found in ([#!srini97diss!#,#!srini97iwpt!#]).

Next: Tree Database Up: System Description Previous: System Description

XTAG Project
http://www.cis.upenn.edu/~xtag/