Acronym logo

MLTT
Tokenization by step-by-step transduction



Home
Showroom
Research
Development
Publications
People
Contact
Feedback
Employment
Site Map
 
MLTT Links
Home

Research

EC Projects

Collaborations

Demos

Publications

People

 

 

next up previous
Next: Non-deterministic Tokenization Up: Tokenization Previous: Deterministic Tokenization

 

If the multiword lexicons are very large, their compilation into a single transducer may lead to time and space problems. Moreover, different NLP applications may require different multiword lexicons. For these reasons, it may be advantageous to use another approach, based on multiple transducers that apply in a sequence to yield exactly the same end result as the single tokenizer just described. In general, such a sequence consists of

  • a basic tokenizer which segments any sequence of input characters into simple tokens (i.e. no multiword units) and

  • one or several multiword staplers which identify multiwords and group them together as single units.

The basic tokenizer is compiled as described in the previous section. The multiword staplers are built in the following way. We first define a multiword language, MWL, containing the units to be recognized. The definition assumes the basic tokenization has already been done; the internal word separator is a newline instead of a space. For example, if the multiword expressions consist of ``ad hoc'' and ``and so on'', we define the language as follows:

    MWL = [a d "\n" h o c | a n d "\n" s o "\n" o n ]

In order to define a relation that staples together the MWL expressions, it is useful to start with some auxiliary definitions.

    BEG = ["<<"]   END = [">>"]   BND = [BEG | END]   LIM = ["\n" | .#.]

The BEG and END brackets are markers for the multiword string. The LIM expression is used to check the surrounding context making sure that the beginning and the end of the multiword expression are not part of some other token. The stapler is composed from the three auxiliary relations below:

    Identify = [~$[BND] .o. [MWL @-> BEG ... END || LIM _ LIM ]]
    Staple =   ["\n" -> " " || BEG ~$[BND] _ ~$[BND] END]
    Cleanup =  [BND -> []]

The Identify relation wraps the multiword expressions in MWL inside a pair of auxiliary brackets, << >>, under the left-to-right, longest-match regimen imposed by @-> and under the constraint that the multiword string is properly delimited.gif The Staple relation converts every internal newline in a marked region into a space leaving the final one unchanged. The Cleanup relation eliminates the auxiliary brackets.

The multiword stapler for the MWL expressions is the composition of the three relations defined above:

Stapler = [Identify .o. Staple .o. Cleanup]

The sequential application of the basic tokenizer and the multiword stapler is illustrated in the figure below. As before, we use tex2html_wrap_inline975 to represent the newline symbol in order to save space.

one, two, and so on.
tex2html_wrap_inline995
tex2html_wrap1031
tex2html_wrap_inline995
one tex2html_wrap_inline975 , tex2html_wrap_inline975 two tex2html_wrap_inline975 , tex2html_wrap_inline975 and tex2html_wrap_inline975 so tex2html_wrap_inline975 on tex2html_wrap_inline975 . tex2html_wrap_inline975
* tex2html_wrap_inline995
tex2html_wrap1033
tex2html_wrap_inline995
one tex2html_wrap_inline975 , tex2html_wrap_inline975 two tex2html_wrap_inline975 , tex2html_wrap_inline975 and so on tex2html_wrap_inline975 . tex2html_wrap_inline975

The basic tokenizer can of course be composed with the multiword staplers to form a single, larger transducer if increasing the speed of application is more important than the size of the network.

If the stapling of some multiword expressions is made optional, the tokenization becomes nondeterministic because the multiword interpretation of the MWL string is an alternative to the sequence of single word tokens produced by the basic tokenizer. The next section discusses some cases where it is advantageous not to use deterministic tokenization.


next up previous
Next: Non-deterministic Tokenization Up: Tokenization Previous: Deterministic Tokenization

Signature

PRIVACY | LEGAL