MLTT
Non-deterministic Tokenization

	XRCE Links

MLTT Links
Home
Research
EC Projects
Collaborations
Demos
Publications
People

Next: Light Parsing by Marking Up: Tokenization Previous: Tokenization by step-by-step transduction

The deterministic treatment of multiword expressions as single tokens is problematic because many such expressions have alternate analyses in different contexts. For instance, the string ``de même'' in French can be treated as a single token, meaning similarly, or a sequence of two independent tokens: the preposition ``de'' of followed by the adjective ``même'' same. If the unambiguous tokenizer makes a wrong choice, it may lead to a parse failure or incorrect semantic interpretation. In such cases, a cautious tokenizer produces alternative segmentations postponing the decision to a later processing stage.

With the techniques just introduced, it is easy to make a tokenizer that produces alternative segmentations for some strings. We start by creating a special multiword lexicon for strings such as ``de même'' that should be analyzed either as a single token or as a sequence of tokens. If we are using the step-by-step approach in the previous section, we introduce into the cascade a second, ambiguous stapler transducer that optionally adjusts the output of the basic tokenizer for these potential multiword items. This optional stapler is defined exactly like the unambiguous stapler except that we include the universal identity relation, ?*, to allow for any string to be mapped to itself.

    OptionalStapler = [[Identify .o. Staple .o. Cleanup] | ?*]

This optional stapler maps the output of the basic tokenizer, ``de même '', into both ``de même '' (identity) and ``de même '' (stapled).

Making tokenization nondeterministic solves one problem but introduces another one. The subsequent stages of processing have to deal with the ambiguous representations of the input. This problem was first addressed in the context of a constraint-based finite-state parser for French [Chanod and Tapanainen 1996, Chanod and Tapanainen 1997] that builds a finite-state network for the input sentence that represents not only the alternative tokenizations but also all the additional ambiguities arising from the morphosyntactic analysis of the tokens. Each path through the network represents one possible tokenization and one possible morphosyntactic analysis for each token. Because of alternative tokenizations, the paths in general do not have the same number of tokens.

At this level, some of these paths can be quickly eliminated by syntactic constraints. Syntactic constraints are expressed as regular expressions, typically containing the restriction operator, and compiled to networks. These automata are intersected with the sentence network to prune out unwanted readings. In particular, they remove unacceptable tokenization schemes, unless the ambiguity is syntactically acceptable.

For instance, the sentence: ``De même les boîtes de même format sont classées ensemble'' is ambiguous at the tokenization level, because of the ambiguous string ``de même''. This leads to four different paths in the input network, as far as tokenization is concerned:

However, after syntactic analysis, there remains only one analysis where the first ``de même'' is recognized as a multiword adverbial, while the second one is decomposed into two independent tokens:

    de même     +Adv +Cap +MWE                   +Adverbial
    le          +InvGen +PL +Def +Det            +NounPrMod
    boîte       +Fem +PL +Noun                   +Subject
    de                                           +Prep
    même        +InvGen +SG +Adj                 +NounPrMod
    format      +Masc +SG +Noun                  +PPObj
    être        +IndP +PL +P3 +Verb +Copl +Auxi  +MainV
    classer     +PaPrt +Fem +PL +Verb            +PastPart
    ensemble    +Adv                             +Adverbial .

This is due to the syntactic constraints that reject unwanted analyses, including incorrect tokenizations. For instance, the path where the two occurrences of ``de même'' are split into two tokens is rejected by several constraints, among which is the following constraint:

    Prep  =>  _
          Coord Prep |                                         (a)
          ~$[ NounHead | VerbHead | Prep ] PPObj               (b)
          ~$[ NounHead | VerbHead | Prep ] [ Inf | PresPart ]  (c)
          Adverbial |                                          (d)
          NounPsMod                                            (e)

This constraint allows the Prep symbol to appear only in the given five contexts that describe the possible continuations for a preposition in a French sentence:

: (a) before a coordinated preposition
: (b) before a PPOBj without prior head nouns or head verb
: (c) before an infinitival or participial verb
: (d) before an adverbial
: (e) before an adjective on the condition that it is a noun postmodifier.

None of these contextual constraints accepts the sequence

[Prep NounPrMod Det Subject] on the path corresponding to ``De même les boîtes''.

This path is then eliminated from the input sentence network.

If, for a given sentence, two ambiguous tokenization paths are syntactically acceptable, they are both preserved after intersection with the constraint networks. This is what happens with the sentence: ``Je pense bien qu'il parle'' where ``bien que'' is ambiguous (meaning although as a multiword token). The sentence can be read either as lit.: ``I think well that he speaks'', i.e. I do think that he speaks or I think although he speaks, which leads to two remaining analyses (paths) in the output sentence network.

Next: Light Parsing by Marking Up: Tokenization Previous: Tokenization by step-by-step transduction

PRIVACY | LEGAL

MLTTNon-deterministic Tokenization

MLTT
Non-deterministic Tokenization