MLTT
Tokenization

	XRCE Links

MLTT Links
Home
Research
EC Projects
Collaborations
Demos
Publications
People

Next: Deterministic Tokenization Up: Applications of the finite-state Previous: Applications of the finite-state

One of the very first steps in any natural language processing system is applying a tokenizer to the input text. A tokenizer is a device that segments an input stream into an ordered sequence of tokens, each token corresponding to an inflected word form, a number, a punctuation mark, or other kind of unit to be passed on to subsequent processing. If the output never contains alternative segmentations for any part of the input, the tokenizer is called deterministic. Deterministic tokenization is commonly seen as an independent preprocessing step unambiguously producing items for subsequent morphological analysis.

In our approach, tokenization is an integral part of language processing, which can be variably adapted to the needs of the subsequent analysis steps. Depending on the following steps, one might want to invoke different tokenization algorithms. In section 4.1.1, we describe a deterministic tokenizer which is useful for stochastic part-of-speech disambiguation. Then in Section 4.1.3 we sketch a situation in which we might need non-deterministic tokenization and describe how that is achieved.

PRIVACY | LEGAL

MLTTTokenization

MLTT
Tokenization