MLTT
Tokenization by step-by-step transduction

	XRCE Links

MLTT Links
Home
Research
EC Projects
Collaborations
Demos
Publications
People

Next: Non-deterministic Tokenization Up: Tokenization Previous: Deterministic Tokenization

If the multiword lexicons are very large, their compilation into a single transducer may lead to time and space problems. Moreover, different NLP applications may require different multiword lexicons. For these reasons, it may be advantageous to use another approach, based on multiple transducers that apply in a sequence to yield exactly the same end result as the single tokenizer just described. In general, such a sequence consists of

a basic tokenizer which segments any sequence of input characters into simple tokens (i.e. no multiword units) and
one or several multiword staplers which identify multiwords and group them together as single units.

The basic tokenizer is compiled as described in the previous section. The multiword staplers are built in the following way. We first define a multiword language, MWL, containing the units to be recognized. The definition assumes the basic tokenization has already been done; the internal word separator is a newline instead of a space. For example, if the multiword expressions consist of ``ad hoc'' and ``and so on'', we define the language as follows:

    MWL = [a d "\n" h o c | a n d "\n" s o "\n" o n ]

In order to define a relation that staples together the MWL expressions, it is useful to start with some auxiliary definitions.

    BEG = ["<<"]   END = [">>"]   BND = [BEG | END]   LIM = ["\n" | .#.]

The BEG and END brackets are markers for the multiword string. The LIM expression is used to check the surrounding context making sure that the beginning and the end of the multiword expression are not part of some other token. The stapler is composed from the three auxiliary relations below:

    Identify = [~$[BND] .o. [MWL @-> BEG ... END || LIM _ LIM ]]
    Staple =   ["\n" -> " " || BEG ~$[BND] _ ~$[BND] END]
    Cleanup =  [BND -> []]

The Identify relation wraps the multiword expressions in MWL inside a pair of auxiliary brackets, << >>, under the left-to-right, longest-match regimen imposed by @-> and under the constraint that the multiword string is properly delimited. The Staple relation converts every internal newline in a marked region into a space leaving the final one unchanged. The Cleanup relation eliminates the auxiliary brackets.

The multiword stapler for the MWL expressions is the composition of the three relations defined above:

: Stapler = [Identify .o. Staple .o. Cleanup]

The sequential application of the basic tokenizer and the multiword stapler is illustrated in the figure below. As before, we use to represent the newline symbol in order to save space.

one, two, and so on.

one , two , and so on .
*

one , two , and so on .

The basic tokenizer can of course be composed with the multiword staplers to form a single, larger transducer if increasing the speed of application is more important than the size of the network.

If the stapling of some multiword expressions is made optional, the tokenization becomes nondeterministic because the multiword interpretation of the MWL string is an alternative to the sequence of single word tokens produced by the basic tokenizer. The next section discusses some cases where it is advantageous not to use deterministic tokenization.

Next: Non-deterministic Tokenization Up: Tokenization Previous: Deterministic Tokenization

PRIVACY | LEGAL

MLTTTokenization by step-by-step transduction

MLTT
Tokenization by step-by-step transduction