|
Next: Non-deterministic Tokenization Up: Tokenization Previous: Deterministic Tokenization
If the multiword lexicons are very large, their compilation into
a single transducer may lead to time and space problems. Moreover,
different NLP applications may require different multiword lexicons.
For these reasons, it may be advantageous to use another approach,
based on multiple transducers that apply in a sequence to yield
exactly the same end result as the single tokenizer just described.
In general, such a sequence consists of
-
a basic tokenizer which segments any sequence of input
characters into simple tokens (i.e. no multiword units) and
-
one or several multiword staplers which identify
multiwords and group them together as single units.
The basic tokenizer is compiled as described in the previous
section. The multiword staplers are built in the following way. We
first define a multiword language, MWL, containing the
units to be recognized. The definition assumes the basic
tokenization has already been done; the internal word separator is a
newline instead of a space. For example, if the multiword
expressions consist of ``ad hoc'' and ``and so on'', we define the
language as follows:
MWL = [a d "\n" h o c | a n d "\n" s o "\n" o n ]
In order to define a relation that staples together the MWL
expressions, it is useful to start with some auxiliary
definitions.
BEG = ["<<"] END = [">>"] BND = [BEG | END] LIM = ["\n" | .#.]
The BEG and END brackets are markers for the
multiword string. The LIM expression is used to check the
surrounding context making sure that the beginning and the end of
the multiword expression are not part of some other token. The
stapler is composed from the three auxiliary relations below:
Identify = [~$[BND] .o. [MWL @-> BEG ... END || LIM _ LIM ]]
Staple = ["\n" -> " " || BEG ~$[BND] _ ~$[BND] END]
Cleanup = [BND -> []]
The Identify relation wraps the multiword expressions in
MWL inside a pair of auxiliary brackets,
<< >>, under the
left-to-right, longest-match regimen imposed by @->
and under the constraint that the multiword string is properly
delimited.
The Staple relation converts every internal newline in a
marked region into a space leaving the final one unchanged. The
Cleanup relation eliminates the auxiliary brackets.
The multiword stapler for the MWL expressions is the
composition of the three relations defined above:
-
Stapler = [Identify .o. Staple .o. Cleanup]
The sequential application of the basic tokenizer and the
multiword stapler is illustrated in the figure below. As before, we
use to represent the newline symbol in order to save space.
one, two, and so on.
one , two , and so on . *
one , two , and so on .
The basic tokenizer can of course be composed with the multiword
staplers to form a single, larger transducer if increasing the speed
of application is more important than the size of the network.
If the stapling of some multiword expressions is made optional,
the tokenization becomes nondeterministic because the multiword
interpretation of the MWL string is an alternative to the
sequence of single word tokens produced by the basic tokenizer. The
next section discusses some cases where it is advantageous not to
use deterministic tokenization.
Next: Non-deterministic Tokenization Up: Tokenization Previous: Deterministic Tokenization |