Next: Tree Database
Up: System
Description Previous: System
Description
Since we are
working with lexicalized TAGs, each word in the sentence selects at least one
tree. The advantage of a lexicalized formalism like LTAGs is that rather than
parsing with all the trees in the grammar, we can parse with only the trees
selected by the words in the input sentence. In the XTAG system, the selection
of trees by the words is done in several steps. Each step attempts to reduce
ambiguity, i.e. reduce the number of trees selected by the words in the
sentence.
- Morphological Analysis and POS Tagging
- The input sentence is first submitted to the Morphological Analyzer
and the Tagger. The morphological analyzer ([#!karp92!#]) consists
of a disk-based database (a compiled version of the derivational rules) which
is used to map an inflected word into its stem, part of speech and feature
equations corresponding to inflectional information. These features are
inserted at the anchor node of the tree eventually selected by the stem. The
POS Tagger can be disabled in which case only information from the
morphological analyzer is used. The morphology data was originally extracted
from the Collins English Dictionary ([#!ced79!#]) and Oxford Advanced
Learner's Dictionary ([#!oald74!#]) available through ACL-DCI
([#!liberman89!#]), and then cleaned up and augmented by hand ([#!karp92!#]).
- POS Blender
- The output from the morphological analyzer and the POS tagger go into the
POS Blender which uses the output of the POS tagger as a filter on the
output of the morphological analyzer. Any words that are not found in the
morphological database are assigned the POS given by the tagger.
- Syntactic Database
- The syntactic database contains the mapping between particular stem(s) and
the tree templates or tree-families stored in the Tree Database (see
Table 3.1).
The syntactic database also contains a list of feature equations that capture
lexical idiosyncrasies. The output of the POS Blender is used to search the
Syntactic Database to produce a set of lexicalized trees with the
feature equations associated with the word(s) in the syntactic database
unified with the feature equations associated with the trees. Note that the
features in the syntactic database can be assigned to any node in the tree and
not just to the anchor node. The syntactic database entries were originally
extracted from the Oxford Advanced Learner's Dictionary ([#!oald74!#]) and
Oxford Dictionary for Contemporary Idiomatic English ([#!cie75!#]) available
through ACL-DCI ([#!liberman89!#]), and then modified and augmented by hand
([#!EgediMartin94!#]). There are more than 31,000 syntactic database
entries.3.1
Selected entries from this database are shown in Table 3.2.
- Default Assignment
- For words that are not found in the syntactic database, default trees and
tree-families are assigned based on their POS tag.
- Filters
- Some of the lexicalized trees chosen in previous stages can be eliminated
in order to reduce ambiguity. Two methods are currently used: structural
filters which eliminate trees which have impossible spans over the input
sentence and a statistical filter based on unigram probabilities of
non-lexicalized trees (from a hand corrected set of approximately 6000 parsed
sentences). These methods speed the runtime by approximately 87%.
- Supertagging
- Before parsing, one can avail of an optional step of supertagging
the sentence. This step uses statistical disambiguation to assign a unique
elementary tree (or supertag) to each word in the sentence. These
assignments can then be hand-corrected. These supertags are used as a filter
on the tree assignments made so far. More information on supertagging can be
found in ([#!srini97diss!#,#!srini97iwpt!#]).
Next: Tree Database
Up: System
Description Previous: System
Description
XTAG Project
http://www.cis.upenn.edu/~xtag/