Tokenization

As mentioned in the previous section, it is often useful to represent a text as a list of tokens. The process of breaking a text up into its constituent tokens is known as tokenization. Tokenization can occur at a number of different levels: a text could be broken up into paragraphs, sentences, words, syllables, or phonemes. And for any given level of tokenization, there are many different algorithms for breaking up the text. For example, at the word level, it is not immediately clear how to treat such strings as "can't," "$22.50," "New York," and "so-called."

NLTK defines a general interface for tokenizing texts, the TokenizerI class. This interface is used by all tokenizers, regardless of what level they tokenize at or what algorithm they use. It defines a single method, tokenize, which takes a string, and returns a list of Tokens.

NLTK Interfaces

TokenizerI is the first "interface" class we've encountered; at this point, we'll take a short digression to explain how interfaces are implemented in NLTK.

An interface gives a partial specification of the behavior of a class, including specifications for methods that the class should implement. For example, a "comparable" interface might specify that a class must implement a comparison method. Interfaces do not give a complete specification of a class; they only specify a minimum set of methods and behaviors which should be implemented by the class. For example, the TokenizerI interface specifies that a tokenizer class should implement a tokenize method, which takes a string, and returns a list of Tokens; but it does not specify what other methods the class should implement (if any).

The notion of "interfaces" can be very useful in ensuring that different classes work together correctly. Although the concept of "interfaces" is supported in many languages, such as Java, there is no native support for interfaces in Python.

NLTK therefore implements interfaces using classes, all of whose methods raise the NotImplementedError exception. To distinguish interfaces from other classes, they are always named with a trailing "I". If a class implements an interface, then it should be a subclass of the interface. For example, the WSTokenizer class implements the TokenizerI interface, and so it is a subclass of TokenizerI.

The whitespace tokenizer

A simple example of a tokenizer is the WSTokenizer, which breaks a text into words, assuming that words are separated by whitespace (space, enter, and tab characters). We can use the WSTokenizer constructor to build a new whitespace tokenizer:
    >>> tokenizer = WSTokenizer() 
Once we have built the tokenizer, we can use it to process texts:
    >>> tokenizer.tokenize(text_str) 
    ['Hello'@[0w], 'world.'@[1w], 'This'@[2w], 'is'@[3w], 
     'a'@[4w], 'test'@[5w], 'file.'@[6w]]

However, this tokenizer is not ideal for many tasks. For example, we might want punctuation to be included as separate tokens; or we might want names like "New York" to be included as single tokens.

The regular expression tokenizer

The RETokenizer is a more powerful tokenizer, which uses a regular expression to determine how text should be split up. This regular expression specifies the format of a valid word. For example, if we wanted to mimic the behavior or WSTokenizer, we could define the following RETokenizer:
    >>> tokenizer = RETokenizer(r'[^\s]+') 
    >>> tokenizer.tokenize(example_text) 
    ['Hello.'@[0w], "Isn't"@[1w], 'this'@[2w], 'fun?'@[3w]]
(The regular expression \s matches any whitespace character.)

To define a tokenizer that includes punctuation as separate tokens, we could use:
    >>> regexp = r'\w+|[^\w\s]+'
    '\w+|[^\w\s]+'
    >>> tokenizer = RETokenizer(regexp) 
    >>> tokenizer.tokenize(example_text) 
    ['Hello'@[0w], '.'@[1w], 'Isn'@[2w], "'"@[3w], 't'@[4w], 
     'this'@[5w], 'fun'@[6w], '?'@[7w]]
The regular expression in this example will match either a sequence of alphanumeric characters (letters and numbers); or a sequence of punctuation characters.

There are a number of ways we might want to improve this regular expression. For example, it currently breaks the string "$22.50" into four tokens; but we might want it to include this as a single token. One approach to making this change would be to add a new clause to the tokenizer's regular expression, which is specialized for handling strings of this form:
    >>> example_text = 'That poster costs $22.40.'
    >>> regexp = r'(\w+)|(\$\d+\.\d+)|([^\w\s]+)'
    '(\w+)|(\$\d+\.\d+)|([^\w\s]+)'
    >>> tokenizer = RETokenizer(regexp) 
    >>> tokenizer.tokenize(example_text) 
    ['That'@[0w], 'poster'@[1w], 'costs'@[2w], '$22.40'@[3w], '.'@[4w]]
Of course, more general solutions to this problem are also possible, using different regular expressions.