|
Next: Tokenization by step-by-step transduction
Up: Tokenization Previous: Tokenization
The simplest kind of deterministic tokenizer is an
unambiguous
transducer that splits the input stream into a unique sequence of
tokens, one per line, taking into account some general character
classes but not using any language-specific information:
Letter = [ A | B | C | ...... | x | y | z ]
WhiteSpace = " " | "\t" | "\n"
Other = ? - Letter - WhiteSpace
With these definitions, we can compile a simple tokenizing
transducer that inserts newlines to mark token boundaries from the
following regular expression. It represents the composition of three
simple replace relations, as explained below.
WhiteSpace+ @-> " "
.o.
Letter+ @-> ... "\n", Other+ @-> ... "\n"
.o.
" " -> [] || .#. | "\n" _
The first relation reduces strings of whitespace characters into
a single blank using longest-match replacement. The second inserts a
newline as a token boundary after longest matches of letter
sequences and other non-whitespace sequences. The third formula, a
contexted replace expression, denotes a relation that eliminates any
initial space and all spaces that follow a token boundary. The
composite single transducer consists of 5 states and 170 arcs. It is
unambiguous: every sequence of input characters is mapped into a
unique sequence of tokens.
Exactly the same tokenization can also be obtained by compiling
the three replacements separately and applying the resulting
transducers in a sequence, each modifying the output of the previous
one. This step-by-step approach, a cascade of transductions, will be
discussed in section 4.1.2.
Extending this simple tokenizer, we can create more sophisticated
tokenizer transducers which are language-specific. In order to
divide the input text into sentences and words recognized by
subsequent processing steps, these tokenizers need to know about
abbreviations, conjoined clitics, and multiword expressions that
contain internal spaces, hyphens, and other special symbols. A
lexicon compiled as a finite-state language
(TokenLexicon below) containing such exceptional tokens
can be included in the middle line of the preceding formula.
[Letter+ | TokenLexicon ] @-> ... "\n", Other+ @-> ... "\n"
Because we use the longest-match operator, any spaces and
punctuation characters in the entries of TokenLexicon are
mapped to themselves and do not count as token boundaries. For
example, if the lexicon includes the strings ``-tu'', ``l' '',
and ``¨¤ côt¨¦ de'', a French tokenizer would split the string
``Vois-tu l'arbre ¨¤ côt¨¦ de la maison?''
into the following tokens (token-bounding newlines are written as
):
-
- Vois
-tu
l'
arbre
¨¤ côt¨¦ de
la
maison
?
Multiword tokens are of several types:
- adverbial expressions (``all of a sudden'', ``a priori'', ``to
and fro'')
- prepositions (``in front of'', ``in spite of'', ``with respect
to'')
- dates (``January 2nd''), time expressions (``2:00 am'')
- proper names (``New York'', ``Rank Xerox'')
- and other units
A word of warning is in order. An unambiguous tokenizer analyzes
every multiword expression consistently as a single token, even if
the component words should be separated in a given context. For
example, if ``in general'' is included in the multiword lexicon,
then it will be tokenized as a unit in both of the examples: ``in
general, he ...'' and ``in general meetings''. Therefore,
multiword lexicons used for unambiguous tokenization must be
carefully and conservatively designed.
Next: Tokenization by step-by-step transduction
Up: Tokenization Previous: Tokenization |