Acronym logo

MLTT
Deterministic Tokenization



Home
Showroom
Research
Development
Publications
People
Contact
Feedback
Employment
Site Map
 
MLTT Links
Home

Research

EC Projects

Collaborations

Demos

Publications

People

 

 

next up previous
Next: Tokenization by step-by-step transduction Up: Tokenization Previous: Tokenization

 

The simplest kind of deterministic tokenizer is an unambiguousgif transducer that splits the input stream into a unique sequence of tokens, one per line, taking into account some general character classes but not using any language-specific information:

    Letter =     [ A | B | C |  ......  | x | y | z ]
    WhiteSpace = " " | "\t" | "\n"
    Other =      ? - Letter - WhiteSpace

With these definitions, we can compile a simple tokenizing transducer that inserts newlines to mark token boundaries from the following regular expression. It represents the composition of three simple replace relations, as explained below.

           WhiteSpace+ @-> " "
           .o.
           Letter+ @-> ... "\n", Other+ @-> ... "\n"
           .o.
           " " -> [] || .#. | "\n"  _

The first relation reduces strings of whitespace characters into a single blank using longest-match replacement. The second inserts a newline as a token boundary after longest matches of letter sequences and other non-whitespace sequences. The third formula, a contexted replace expression, denotes a relation that eliminates any initial space and all spaces that follow a token boundary. The composite single transducer consists of 5 states and 170 arcs. It is unambiguous: every sequence of input characters is mapped into a unique sequence of tokens.

Exactly the same tokenization can also be obtained by compiling the three replacements separately and applying the resulting transducers in a sequence, each modifying the output of the previous one. This step-by-step approach, a cascade of transductions, will be discussed in section 4.1.2.

Extending this simple tokenizer, we can create more sophisticated tokenizer transducers which are language-specific. In order to divide the input text into sentences and words recognized by subsequent processing steps, these tokenizers need to know about abbreviations, conjoined clitics, and multiword expressions that contain internal spaces, hyphens, and other special symbols. A lexicon compiled as a finite-state language (TokenLexicon below) containing such exceptional tokens can be included in the middle line of the preceding formula.

    [Letter+ | TokenLexicon ] @-> ... "\n", Other+ @-> ... "\n"

Because we use the longest-match operator, any spaces and punctuation characters in the entries of TokenLexicon are mapped to themselves and do not count as token boundaries. For example, if the lexicon includes the strings ``-tu'', ``l' '', and ``¨¤ côt¨¦ de'', a French tokenizer would split the string ``Vois-tu l'arbre ¨¤ côt¨¦ de la maison?''gif into the following tokens (token-bounding newlines are written as tex2html_wrap_inline975 ):

Vois tex2html_wrap_inline975 -tu tex2html_wrap_inline975 l' tex2html_wrap_inline975 arbre tex2html_wrap_inline975 ¨¤ côt¨¦ de tex2html_wrap_inline975 la tex2html_wrap_inline975 maison tex2html_wrap_inline975 ? tex2html_wrap_inline975

Multiword tokens are of several types:

  • adverbial expressions (``all of a sudden'', ``a priori'', ``to and fro'')
  • prepositions (``in front of'', ``in spite of'', ``with respect to'')
  • dates (``January 2nd''), time expressions (``2:00 am'')
  • proper names (``New York'', ``Rank Xerox'')
  • and other units

A word of warning is in order. An unambiguous tokenizer analyzes every multiword expression consistently as a single token, even if the component words should be separated in a given context. For example, if ``in general'' is included in the multiword lexicon, then it will be tokenized as a unit in both of the examples: ``in general, he ...'' and ``in general meetings''. Therefore, multiword lexicons used for unambiguous tokenization must be carefully and conservatively designed.


next up previous
Next: Tokenization by step-by-step transduction Up: Tokenization Previous: Tokenization

Signature

PRIVACY | LEGAL