Example: Processing Tokenized Text

In this section, we show how you can use NLTK to examine the distribution of word lengths in a document. This is meant to be a simple example of how the tools we have introduced can be used to solve a simple NLP problem. The distribution of word lengths in a document can give clues to other properties, such as the document's style, or the document's language.

We present three different approaches to solving this problem; each one illustrates different techniques, which might be useful for other problems.

Word Length Distributions 1: Using a List

To begin with, we'll need to extract the words from a corpus that we wish to test. We'll use the WSTokenizer to tokenize the corpus:
    >>> corpus = open('corpus.txt').read() 
    >>> tokens = WSTokenizer().tokenize(corpus) 

Now, we will construct a list wordlen_count_list, which gives the number of words that have a given length. In particular, wordlen_count_list[i] is the number of words whose length is i.

When constructing this list, we must be careful not to try to add a value past the end of the list. Therefore, whenever we encounter a word that is longer than any previous words, we will add enough zeros to wordlen_count_list that we can store the occurrence of the new word:
    >>> wordlen_count_list = []
    >>> for token in tokens:
    ...     wordlen = len(token.type())
    ...     # Add zeros until wordlen_count_list is long enough
    ...     while wordlen >= len(wordlen_count_list):
    ...         wordlen_count_list.append(0)
    ...     # Increment the count for this word length
    ...     wordlen_count_list[wordlen] += 1

In order to plot the results, we must create a list of points. These points are simply (wordlen, count) pairs. We can construct this list with:
    >>> points = [(i, wordlen_count_list[i]) 
    ...           for i in range(len(wordlen_count_list))]
(For more information on list comprehensions, see the "Advanced Python Features" tutorial.)

Finally, we can plot the results, using the nltk.draw.plot_graph module:
    >>> plot(Marker(points))

The complete code for this example, including necessary import statements, is:
from nltk.token import WSTokenizer
from nltk.draw.plot_graph import plot, Marker

# Extract a list of words from the corpus
corpus = open('corpus.txt').read() 
tokens = WSTokenizer().tokenize(corpus) 

# Count up how many times each word length occurs
wordlen_count_list = []
for token in tokens:
     wordlen = len(token.type())
     # Add zeros until wordlen_count_list is long enough
     while wordlen >= len(wordlen_count_list):
         wordlen_count_list.append(0)
     # Increment the count for this word length
     wordlen_count_list[wordlen] += 1

# Construct a list of (wordlen, count) and plot the results.
points = [(i, wordlen_count_list[i]) 
           for i in range(len(wordlen_count_list))]
plot(Marker(points))

Word Length Distributions 2: Using a Dictionary

We have been examining the function from word lengths to token counts. In this example, the range of the function (i.e., the set of word lengths) is ordered and relatively small. However, we often wish to examine functions whose ranges are not so well behaved. In such cases, dictionaries can be a powerful tool. The following code uses a dictionary to count up the number of times each word length occurs:
    >>> wordlen_count_dict = {}
    >>> for token in tokens:
    ...     word_length = len(token.type())
    ...     if wordlen_count_dict.has_key(word_length):
    ...         wordlen_count_dict[word_length] += 1
    ...     else:
    ...         wordlen_count_dict[word_length] = 1

To plot the results, we need a list of (wordlen, count) pairs. This is simply the items of the dictionary:
    >>> points = wordlen_count_dict.items() 
    >>> plot(Marker(points))

The complete code for this example, including necessary import statements, is:
from nltk.token import WSTokenizer
from nltk.draw.plot_graph import plot, Marker

# Extract a list of words from the corpus
corpus = open('corpus.txt').read() 
tokens = WSTokenizer().tokenize(corpus) 

# Construct a dictionary mapping word lengths to token counts
wordlen_count_dict = {}
for token in tokens:
    word_length = len(token.type())
    if wordlen_count_dict.has_key(word_length):
        wordlen_count_dict[word_length] += 1
    else:
        wordlen_count_dict[word_length] = 1

# Construct a list of (wordlen, count) and plot the results.
points = wordlen_count_dict.items() 
plot(Marker(points))

Word Length Distributions 3: Using a Frequency Distribution

The nltk.probability module defines two interfaces, FreqDistI and ProbDistI, for modeling frequency distributions and probability distributions, respectively. In this example, we use a frequency distribution to find the relationship between word lengths and token counts.

We will use a SimpleFreqDist, which is a simple (but sometimes inefficient) implementation of the FreqDistI interface. For this example, three methods of SimpleFreqDist are relevant:

First, we construct the frequency distribution for the word lengths:
    >>> wordlen_freqs = SimpleFreqDist()
    >>> for token in tokens:
    ...     wordlen_freqs.inc(len(token.type()))

Next, we extract the set of word lengths that were found in the corpus:
    >>> wordlens = wordlen_freqs.samples()

Finally, we construct a list of (wordlen, count) pairs, and plot it:
    >>> points = [(wordlen, wordlen_freqs.count(wordlen))
    ...           for wordlen in wordlens]
    >>> plot(Marker(points))

The complete code for this example, including necessary import statements, is:
from nltk.token import WSTokenizer
from nltk.draw.plot_graph import plot, Marker
from nltk.probability import SimpleFreqDist

# Extract a list of words from the corpus
corpus = open('corpus.txt').read() 
tokens = WSTokenizer().tokenize(corpus) 

# Construct a frequency distribution of word lengths
wordlen_freqs = SimpleFreqDist()
for token in tokens:
    wordlen_freqs.inc(len(token.type()))

# Exctract the set of word lengths found in the corpus
wordlens = wordlen_freqs.samples()

# Construct a list of (wordlen, count) and plot the results.
points = [(wordlen, wordlen_freqs.count(wordlen))
          for wordlen in wordlens]
plot(Marker(points))

For more information about frequency distributions, see the Probability Tutorial.