|
Next: Deterministic Tokenization Up: Applications of the finite-state
Previous: Applications of the finite-state
One of the very first steps in any natural language processing
system is applying a tokenizer to the input text. A tokenizer
is a device that segments an input stream into an ordered sequence
of tokens, each token corresponding to an inflected word
form, a number, a punctuation mark, or other kind of unit to be
passed on to subsequent processing. If the output never contains
alternative segmentations for any part of the input, the tokenizer
is called deterministic. Deterministic tokenization is
commonly seen as an independent preprocessing step unambiguously
producing items for subsequent morphological analysis.
In our approach, tokenization is an integral part of language
processing, which can be variably adapted to the needs of the
subsequent analysis steps. Depending on the following steps, one
might want to invoke different tokenization algorithms. In
section 4.1.1,
we describe a deterministic tokenizer which is useful for stochastic
part-of-speech disambiguation. Then in Section 4.1.3
we sketch a situation in which we might need non-deterministic
tokenization and describe how that is achieved.
|