|
Next: Light Parsing by Marking Up: Tokenization Previous: Tokenization by step-by-step transduction
The deterministic treatment of multiword expressions as single
tokens is problematic because many such expressions have alternate
analyses in different contexts. For instance, the string ``de m那me''
in French can be treated as a single token, meaning
similarly, or a sequence of two independent tokens: the
preposition ``de'' of followed by the adjective ``m那me''
same. If the unambiguous tokenizer makes a wrong choice, it
may lead to a parse failure or incorrect semantic interpretation. In
such cases, a cautious tokenizer produces alternative segmentations
postponing the decision to a later processing stage.
With the techniques just introduced, it is easy to make a
tokenizer that produces alternative segmentations for some strings.
We start by creating a special multiword lexicon for strings such as
``de m那me'' that should be analyzed either as a single token or as a
sequence of tokens. If we are using the step-by-step approach in the
previous section, we introduce into the cascade a second, ambiguous
stapler transducer that optionally adjusts the output of the basic
tokenizer for these potential multiword items. This optional stapler
is defined exactly like the unambiguous stapler except that we
include the universal identity relation, ?*, to allow for
any string to be mapped to itself.
OptionalStapler = [[Identify .o. Staple .o. Cleanup] | ?*]
This optional stapler maps the output of the basic tokenizer,
``de m那me '', into both ``de m那me '' (identity) and ``de m那me '' (stapled).
Making tokenization nondeterministic solves one problem but
introduces another one. The subsequent stages of processing have to
deal with the ambiguous representations of the input. This problem
was first addressed in the context of a constraint-based
finite-state parser for French [Chanod
and Tapanainen 1996, Chanod
and Tapanainen 1997] that builds a finite-state network for the
input sentence that represents not only the alternative
tokenizations but also all the additional ambiguities arising from
the morphosyntactic analysis of the tokens. Each path through the
network represents one possible tokenization and one possible
morphosyntactic analysis for each token. Because of alternative
tokenizations, the paths in general do not have the same number of
tokens.
At this level, some of these paths can be quickly eliminated by
syntactic constraints. Syntactic constraints are expressed as
regular expressions, typically containing the restriction operator,
and compiled to networks. These automata are intersected with the
sentence network to prune out unwanted readings. In particular, they
remove unacceptable tokenization schemes, unless the ambiguity is
syntactically acceptable.
For instance, the sentence: ``De m那me les boîtes de m那me format
sont class谷es ensemble''
is ambiguous at the tokenization level, because of the ambiguous
string ``de m那me''. This leads to four different paths in the input
network, as far as tokenization is concerned :
However, after syntactic analysis, there remains only one
analysis where the first ``de m那me'' is recognized as a multiword
adverbial, while the second one is decomposed into two independent
tokens:
de m那me +Adv +Cap +MWE +Adverbial
le +InvGen +PL +Def +Det +NounPrMod
boîte +Fem +PL +Noun +Subject
de +Prep
m那me +InvGen +SG +Adj +NounPrMod
format +Masc +SG +Noun +PPObj
那tre +IndP +PL +P3 +Verb +Copl +Auxi +MainV
classer +PaPrt +Fem +PL +Verb +PastPart
ensemble +Adv +Adverbial .
This is due to the syntactic constraints that reject unwanted
analyses, including incorrect tokenizations. For instance, the path
where the two occurrences of ``de m那me'' are split into two tokens
is rejected by several constraints, among which is the following
constraint:
Prep => _
Coord Prep | (a)
~$[ NounHead | VerbHead | Prep ] PPObj (b)
~$[ NounHead | VerbHead | Prep ] [ Inf | PresPart ] (c)
Adverbial | (d)
NounPsMod (e)
This constraint allows the Prep symbol to appear only in
the given five contexts that describe the possible continuations for
a preposition in a French sentence:
-
- (a) before a coordinated preposition
-
- (b) before a PPOBj without prior head nouns or head verb
-
- (c) before an infinitival or participial verb
-
- (d) before an adverbial
-
- (e) before an adjective on the condition that it is a noun
postmodifier.
None of these contextual constraints accepts the sequence
[Prep NounPrMod Det Subject] on the path corresponding
to ``De m那me les boîtes''.
This path is then eliminated from the input sentence network.
If, for a given sentence, two ambiguous tokenization paths are
syntactically acceptable, they are both preserved after intersection
with the constraint networks. This is what happens with the
sentence: ``Je pense bien qu'il parle'' where ``bien que'' is
ambiguous (meaning although as a multiword token). The
sentence can be read either as lit.: ``I think well that he
speaks'', i.e. I do think that he speaks or I think
although he speaks, which leads to two remaining analyses
(paths) in the output sentence network.
Next: Light Parsing by Marking Up: Tokenization Previous: Tokenization by step-by-step transduction
|