Acronym logo

MLTT
Tokenization



Home
Showroom
Research
Development
Publications
People
Contact
Feedback
Employment
Site Map
 
MLTT Links
Home

Research

EC Projects

Collaborations

Demos

Publications

People

 

 

next up previous
Next: Deterministic Tokenization Up: Applications of the finite-state Previous: Applications of the finite-state

 

One of the very first steps in any natural language processing system is applying a tokenizer to the input text. A tokenizer is a device that segments an input stream into an ordered sequence of tokens, each token corresponding to an inflected word form, a number, a punctuation mark, or other kind of unit to be passed on to subsequent processing. If the output never contains alternative segmentations for any part of the input, the tokenizer is called deterministic. Deterministic tokenization is commonly seen as an independent preprocessing step unambiguously producing items for subsequent morphological analysis.

In our approach, tokenization is an integral part of language processing, which can be variably adapted to the needs of the subsequent analysis steps. Depending on the following steps, one might want to invoke different tokenization algorithms. In section 4.1.1, we describe a deterministic tokenizer which is useful for stochastic part-of-speech disambiguation. Then in Section 4.1.3 we sketch a situation in which we might need non-deterministic tokenization and describe how that is achieved.




Signature

PRIVACY | LEGAL