MLTT
Deterministic Tokenization

	XRCE Links

MLTT Links
Home
Research
EC Projects
Collaborations
Demos
Publications
People

Next: Tokenization by step-by-step transduction Up: Tokenization Previous: Tokenization

The simplest kind of deterministic tokenizer is an unambiguous transducer that splits the input stream into a unique sequence of tokens, one per line, taking into account some general character classes but not using any language-specific information:

    Letter =     [ A | B | C |  ......  | x | y | z ]
    WhiteSpace = " " | "\t" | "\n"
    Other =      ? - Letter - WhiteSpace

With these definitions, we can compile a simple tokenizing transducer that inserts newlines to mark token boundaries from the following regular expression. It represents the composition of three simple replace relations, as explained below.

           WhiteSpace+ @-> " "
           .o.
           Letter+ @-> ... "\n", Other+ @-> ... "\n"
           .o.
           " " -> [] || .#. | "\n"  _

The first relation reduces strings of whitespace characters into a single blank using longest-match replacement. The second inserts a newline as a token boundary after longest matches of letter sequences and other non-whitespace sequences. The third formula, a contexted replace expression, denotes a relation that eliminates any initial space and all spaces that follow a token boundary. The composite single transducer consists of 5 states and 170 arcs. It is unambiguous: every sequence of input characters is mapped into a unique sequence of tokens.

Exactly the same tokenization can also be obtained by compiling the three replacements separately and applying the resulting transducers in a sequence, each modifying the output of the previous one. This step-by-step approach, a cascade of transductions, will be discussed in section 4.1.2.

Extending this simple tokenizer, we can create more sophisticated tokenizer transducers which are language-specific. In order to divide the input text into sentences and words recognized by subsequent processing steps, these tokenizers need to know about abbreviations, conjoined clitics, and multiword expressions that contain internal spaces, hyphens, and other special symbols. A lexicon compiled as a finite-state language (TokenLexicon below) containing such exceptional tokens can be included in the middle line of the preceding formula.

    [Letter+ | TokenLexicon ] @-> ... "\n", Other+ @-> ... "\n"

Because we use the longest-match operator, any spaces and punctuation characters in the entries of TokenLexicon are mapped to themselves and do not count as token boundaries. For example, if the lexicon includes the strings ``-tu'', ``l' '', and ``à côté de'', a French tokenizer would split the string ``Vois-tu l'arbre à côté de la maison?'' into the following tokens (token-bounding newlines are written as ):

: Vois -tu l' arbre à côté de la maison ?

Multiword tokens are of several types:

adverbial expressions (``all of a sudden'', ``a priori'', ``to and fro'')

prepositions (``in front of'', ``in spite of'', ``with respect to'')

dates (``January 2nd''), time expressions (``2:00 am'')

proper names (``New York'', ``Rank Xerox'')

and other units

A word of warning is in order. An unambiguous tokenizer analyzes every multiword expression consistently as a single token, even if the component words should be separated in a given context. For example, if ``in general'' is included in the multiword lexicon, then it will be tokenized as a unit in both of the examples: ``in general, he ...'' and ``in general meetings''. Therefore, multiword lexicons used for unambiguous tokenization must be carefully and conservatively designed.

Next: Tokenization by step-by-step transduction Up: Tokenization Previous: Tokenization

PRIVACY | LEGAL

MLTTDeterministic Tokenization

MLTT
Deterministic Tokenization