Acronym logo

MLTT
Non-deterministic Tokenization



Home
Showroom
Research
Development
Publications
People
Contact
Feedback
Employment
Site Map
 
MLTT Links
Home

Research

EC Projects

Collaborations

Demos

Publications

People

 

 

next up previous
Next: Light Parsing by Marking Up: Tokenization Previous: Tokenization by step-by-step transduction

 

The deterministic treatment of multiword expressions as single tokens is problematic because many such expressions have alternate analyses in different contexts. For instance, the string ``de m那me'' in French can be treated as a single token, meaning similarly, or a sequence of two independent tokens: the preposition ``de'' of followed by the adjective ``m那me'' same. If the unambiguous tokenizer makes a wrong choice, it may lead to a parse failure or incorrect semantic interpretation. In such cases, a cautious tokenizer produces alternative segmentations postponing the decision to a later processing stage.

With the techniques just introduced, it is easy to make a tokenizer that produces alternative segmentations for some strings. We start by creating a special multiword lexicon for strings such as ``de m那me'' that should be analyzed either as a single token or as a sequence of tokens. If we are using the step-by-step approach in the previous section, we introduce into the cascade a second, ambiguous stapler transducer that optionally adjusts the output of the basic tokenizer for these potential multiword items. This optional stapler is defined exactly like the unambiguous stapler except that we include the universal identity relation, ?*, to allow for any string to be mapped to itself.

    OptionalStapler = [[Identify .o. Staple .o. Cleanup] | ?*]

This optional stapler maps the output of the basic tokenizer, ``de tex2html_wrap_inline975 m那me tex2html_wrap_inline975 '', into both ``de tex2html_wrap_inline975 m那me tex2html_wrap_inline975 '' (identity) and ``de m那me tex2html_wrap_inline975 '' (stapled).

Making tokenization nondeterministic solves one problem but introduces another one. The subsequent stages of processing have to deal with the ambiguous representations of the input. This problem was first addressed in the context of a constraint-based finite-state parser for French [Chanod and Tapanainen 1996, Chanod and Tapanainen 1997] that builds a finite-state network for the input sentence that represents not only the alternative tokenizations but also all the additional ambiguities arising from the morphosyntactic analysis of the tokens. Each path through the network represents one possible tokenization and one possible morphosyntactic analysis for each token. Because of alternative tokenizations, the paths in general do not have the same number of tokens.

At this level, some of these paths can be quickly eliminated by syntactic constraints. Syntactic constraints are expressed as regular expressions, typically containing the restriction operator, and compiled to networks. These automata are intersected with the sentence network to prune out unwanted readings. In particular, they remove unacceptable tokenization schemes, unless the ambiguity is syntactically acceptable.

For instance, the sentence: ``De m那me les boîtes de m那me format sont class谷es ensemble''gif is ambiguous at the tokenization level, because of the ambiguous string ``de m那me''. This leads to four different paths in the input network, as far as tokenization is concernedgif:

figure405

However, after syntactic analysis, there remains only one analysis where the first ``de m那me'' is recognized as a multiword adverbial, while the second one is decomposed into two independent tokens:

    de m那me     +Adv +Cap +MWE                   +Adverbial
    le          +InvGen +PL +Def +Det            +NounPrMod
    boîte       +Fem +PL +Noun                   +Subject
    de                                           +Prep
    m那me        +InvGen +SG +Adj                 +NounPrMod
    format      +Masc +SG +Noun                  +PPObj
    那tre        +IndP +PL +P3 +Verb +Copl +Auxi  +MainV
    classer     +PaPrt +Fem +PL +Verb            +PastPart
    ensemble    +Adv                             +Adverbial .

This is due to the syntactic constraints that reject unwanted analyses, including incorrect tokenizations. For instance, the path where the two occurrences of ``de m那me'' are split into two tokens is rejected by several constraints, among which is the following constraint:

    Prep  =>  _
          Coord Prep |                                         (a)
          ~$[ NounHead | VerbHead | Prep ] PPObj               (b)
          ~$[ NounHead | VerbHead | Prep ] [ Inf | PresPart ]  (c)
          Adverbial |                                          (d)
          NounPsMod                                            (e)

This constraint allows the Prep symbol to appear only in the given five contexts that describe the possible continuations for a preposition in a French sentence:

(a) before a coordinated preposition
(b) before a PPOBj without prior head nouns or head verb
(c) before an infinitival or participial verb
(d) before an adverbial
(e) before an adjective on the condition that it is a noun postmodifier.

None of these contextual constraints accepts the sequence

[Prep NounPrMod Det Subject] on the path corresponding to ``De tex2html_wrap_inline975 m那me tex2html_wrap_inline975 les tex2html_wrap_inline975 boîtes''.

This path is then eliminated from the input sentence network.

If, for a given sentence, two ambiguous tokenization paths are syntactically acceptable, they are both preserved after intersection with the constraint networks. This is what happens with the sentence: ``Je pense bien qu'il parle'' where ``bien que'' is ambiguous (meaning although as a multiword token). The sentence can be read either as lit.: ``I think well that he speaks'', i.e. I do think that he speaks or I think although he speaks, which leads to two remaining analyses (paths) in the output sentence network.


next up previous
Next: Light Parsing by Marking Up: Tokenization Previous: Tokenization by step-by-step transduction

Signature

PRIVACY | LEGAL