next up previous contents
Next: Acknowledgements Up: Fast Transformation-Based Learning Toolkit Previous: 4. Using The Toolkit   Contents

Subsections


5. Test Cases


5.1 Base Noun Phrase Chunking

We will start the example section with the task of Base NP chunking - where the goal is to identify the basic, non-recursive, noun phrases in a sentence. This task can be converted to a classification task, where each word is to be assigned a class corresponding to its relative position in a noun phrase: inside a noun phrase, beginning a noun phrase or outside a noun phrase. In Figure 5.1, we show an actual sentence from the Wall Street Journal corpus, together with its part-of-speech labeling and the chunk labels.

\begin{figure}{\centering\begin{tabular}{cccccccccccc}
{[}&
A.P.&
Green&
{]}&
cu...
...{tabular}\par }
\par Base NP transformation to a classification task\end{figure}

There are many ways to map from base-NP brackets to base NP chunks (as described in[SV99]); in this particular example we will use B for words that start a base NP, I for words that are inside or end one and O for words that are outside any noun phrase.

The samples in this task are 2 dimensional and are of the form

$\displaystyle \left( \textrm{word},\textrm{pos}\right) $

The POS associated with the word is in this case fixed, and is obtained by tagging the data with your favorite tagger (in our case, the data is the one provided by Ramshaw and Marcus, therefore the tagger is Brill's tagger).

\begin{figure}\subfigure[Parameter file for baseNP chunking]
{\\
\parbox{10cm}{...
...{ word pos chunk => tchunk }
\par Sample files for base NP chunking
\end{figure}

Figure 5.2(a) displays an example of parameter file for the baseNP task. The constraints file (constraints.chunk.templ) can contain constraints of the following form:


Table 5.1: Sample of lexical rule template file

chunk_-1 chunk_0 $ \Rightarrow $ chunk

chunk_0 chunk_1 $ \Rightarrow $ chunk

word_-1 word_0 $ \Rightarrow $ chunk

word_0 word_1 $ \Rightarrow $ chunk

chunk_-1 chunk_0 word_0 $ \Rightarrow $ chunk

chunk_0 chunk_1 word_0 $ \Rightarrow $ chunk

word_-1 word_0 chunk_0 $ \Rightarrow $ chunk

word_0 word_1 chunk_0 $ \Rightarrow $ chunk

word:[-3,-1] $ \Rightarrow $ chunk

word:[1,3] $ \Rightarrow $ chunk

chunk:[-3,-1] $ \Rightarrow $ chunk

chunk:[1,3] $ \Rightarrow $ chunk

 

กก

pos_0 chunk_0 $ \Rightarrow $ chunk

pos_-1 pos_0 $ \Rightarrow $ chunk

pos_0 pos_1 $ \Rightarrow $ chunk

word_0 pos_0 pos_1 $ \Rightarrow $ chunk

pos_-1 pos_0 chunk_-1 chunk_0 $ \Rightarrow $ chunk

pos:[-3,-1] $ \Rightarrow $ chunk

pos:[1,3] $ \Rightarrow $ chunk

pos_-2 pos_-1 pos_0 chunk_0 $ \Rightarrow $ chunk

pos_-1 pos_0 pos_1 chunk_0 $ \Rightarrow $ chunk

pos_0 pos_1 pos_2 chunk_0 $ \Rightarrow $ chunk

chunk_-2 chunk_-1 chunk_0 $ \Rightarrow $ chunk

chunk_-1 chunk_0 chunk_1 $ \Rightarrow $ chunk

chunk_0 chunk_1 chunk_2 $ \Rightarrow $ chunk

 


An example of rule templates one could use are listed in Figure 5.1. A much larger list, consisting of the templates that seemed useful to us, is provided in the directory test-cases/baseNP of the main distribution. Experimentation with different templates is usually needed for good results, so feel free to adjust and create new rule templates as you see fit.


5.1.1 Creating the Training and Test Data

We start by assuming that you have the data in a 3 column format - a sequence of samples

$\displaystyle word\, pos\, tchunk$

where word is the word, pos is the part-of-speech associated with it (this can be the true POS or one assigned by a POS tagger) and tchunk is the true chunk tag associated with the word (I,O or B). The sentences are presumed to be separated by one blank line.

The first step is to determine the most likely chunk tag for each sample; basically one needs to compute either $ ML\left( chunk\vert word\right) $ or $ ML\left( chunk\vert pos\right) $. We found that the second one (most likely given the POS) to be more informative, since base noun phrases are very dependent on the part of speech. The following sequence of commands will assume that you want to create the lexicon based on pos; if the one on word is desired, change the commands accordingly:

The same processing must be performed on the test data as well.

Additionally, you might want to create constraints, which will enforce that the rules not assign priorly unseen chunk tags to the samples. The pos-chunk.lexicon file can be used as constraints on pos and tchunk. If you want to create constraints also on words (for instance, never assign to the the chunk O, or to the word '.' the chunk I), you could run the command:

mcreate_lexicon.prl -d '0=>2' [-n <threshold>] myfile > word-chunk.lexicon
where threshold is the minimum count for a word to enter the constraints5.2. This will create a list of chunk tags for each word with count above the threshold, where the tags are sorted in the order of their cooccurrence count. Then you should create a constraint template, for instance :

word tchunk word-chunk.lexicon
pos tchunk pos-chunk.lexicon
and update the field CONSTRAINTS_FILE in the parameter file to point to the correct file.

5.1.2 Training the baseNP Chunker

The training can be performed using the command:

fnTBL-train myfile.init chunker.rls -F <param_file> [-threshold <threshold>]

The threshold you choose can have a big impact on both the training time and the performance of the program. If you have small data, then a low threshold (0) is useful, as learning all the possible transformations can help. If your data is quite large, then using a threshold of 0 can make the program run for a long time (as there are a lot of rules with a count of 1); a threshold of 1 or 2 can prove more useful, as the rules that have a score of 1 can actually hurt performance, especially if the training and test data are not very similar.

One useful property of TBL is that you can restart the training from where you stopped the last time. The only thing needed is to apply the rules you learned so far to the training data, and you can restart the training from the moment you stopped. Also, if you learned all the rules that have a score of 1, for instance, you can eliminate them from the rule list (as they can be easily identified, because the score is present in those files), and retest the performance of the system.

In conclusion, if you use the entire WSJ data to train the baseNP chunker, use a threshold of 1 if your machine is not super fast, otherwise use a threshold of 0, with the option of eliminating the rules with a score of 1 if they hurt performance. If you train on the smaller WSJ data (sections 15-18), then use a threshold of 0, as the training time is relatively short.

5.1.3 Testing using fnTBL

To test, you need to put the test data in the same 4 column mode file (using the commands from Section 5.1.1) and run the command:

fnTBL testfile.init -F <param_file>

If, in addition, you want to see which rules applied to which samples, add at the end the flag -printRuleTrace. This will add after each sample the character '|', followed by the indices associated with the rules that applied on that particular sample; the first index is 0. You could use the script number-rules.prl, provided in the distribution, to see the rules' indices.

Finally, the README file present in the test-case/baseNP contains all the command lines needed to run the chunker, and the chunker.wsj-small.rls and chunker.wsj-large.rls files contain the rules as learned by the program on the sections 15-19, respectively 02-21, of the WSJ corpus.


5.2 POS Tagging

The TBL solution to the POS tagging problem is broken down into 2 problems:

Though it would be possible for the two tasks to be combined into just one big task, there are good reasons for keeping them separate:

The next two sections will describe the type of rules that can be used for each process, and Section 5.2.3 will describe 2 scripts that automatize most of the POS tagging process, so, for people in a hurry, that is the section to read.


5.2.1 Lexical POS Tagging - Guessing the Unknown Words

The approach taken by fnTBL follows the guidelines set by Brill: the training data is split into 2 separate parts. Lexical priors (well, only the most likely sense) is learned using the first part, and then the second part is used to train a TBL system for the unknown words.

We propose that the initial POS assignment on the second part of the corpus be done according to:

$\displaystyle POS\left( w\right) =\left\{ \begin{array}{cl}
POS_{t}\left( w\rig...
...{ begins with a capital letter}\\
NN & \textrm{otherwise}
\end{array}\right. $

where $ w $ is a word, $ NNP $ is the POS tag corresponding to the proper noun, $ NN $ is the POS tag corresponding to the noun and $ C\left( w\right) $ is the number of times that the word $ w $ appeared in the first part of the training set. Let us observe that if $ T=0 $ then the assignment is done the same way Brill is doing it.

The rule types that can be used for lexical tagging are:

  1. The prefix/suffix of the word is a specified sequence of characters:

    $\displaystyle \left\langle feature_{i}\right\rangle ::\left\langle number\right...
...ight\rangle ::\textasciitilde \textasciitilde \left\langle number\right\rangle $

    For instance the rule

    $\displaystyle word::3\textasciitilde \textasciitilde \, =pre\textasciitilde \textasciitilde \, \Rightarrow \, pos=JJ$

    fires on every word that starts with the letters pre and the rule

    $\displaystyle word::\textasciitilde \textasciitilde 4=\, \textasciitilde \textasciitilde able\, \Rightarrow \, pos=JJ$

    fires only on words that end with the letters able (unable, enjoyable).
  2. Adding a specified prefix/suffix results in a word (i.e. is in a specified large vocabulary)

    $\displaystyle \left\langle feature_{i}\right\rangle ::\left\langle number\right...
...t\, \left\langle feature_{i}\right\rangle ::++\left\langle number\right\rangle $

    For instance

    $\displaystyle word::++2=++ly\, \Rightarrow \, pos=JJ$

    will fires only on words like like, converse, etc (words to which by adding the suffix ly we obtain another word - likely, conversely);
  3. Subtracting a specified prefix/suffix results in a word:

    $\displaystyle \left\langle feature_{i}\right\rangle ::\left\langle number\right...
...t\, \left\langle feature_{i}\right\rangle ::--\left\langle number\right\rangle $

    For instance the rule

    $\displaystyle word::2--=un--\, \Rightarrow \, pos=JJ$

    will fire on words like unspecified, undo.
  4. The word in question contains a specified character (such as -):

    $\displaystyle \left\langle feature\right\rangle ::\left\langle number\right\rangle \left\langle \right\rangle $

    For instance the rule

    $\displaystyle word::1<>=.<>\, \Rightarrow \, pos=CD$

    will fire on each word that has the dot char '.' in it, transforming it into a cardinal.
  5. The word in question appears in to the left/right of a specified word in a long list of bigrams (e.g. appears after the somewhere in a list of bigrams makes it likely to be a noun):

    $\displaystyle \left\langle feature\right\rangle \textasciicircum \textasciicirc...
...ert\, \left\langle feature\right\rangle \, \textasciicircum \textasciicircum -1$

    For instance the rule

    $\displaystyle word\textasciicircum \textasciicircum -1=to\, \Rightarrow \, pos=VB$

    will fire on all words that appear after the in a list of bigrams.
In Table 5.2 a sample of how the rule template file should look like is presented - it is the actual rule file present in the distribution5.3:


Table 5.2: Sample of lexical rule template file

word => pos 
# This are the rules using 5 characters 
# Suffix/prefix identity rules 
pos word::5 => pos 
pos word::5 => pos 
# Suffix addition only at this level 
pos word::++5 => pos 
# Suffix/prefix subtraction rules 
pos word::-5 => pos 
pos word::5- => pos 
# Rules at level 4, etc. 
pos word::4 => pos 
pos word::4 => pos 
pos word::4++ => pos 
pos word::++4 => pos 
pos word::-4 => pos 
pos word::4- => pos 
pos word::3 => pos 
pos word::3 => pos 
pos word::++3 => pos 
pos word::3++ => pos 
pos word::3- => pos 
pos word::-3 => pos 
pos word::2 => pos 
pos word::2 => pos 
pos word::-2 => pos 
pos word::2- => pos 
pos word::++2 => pos 
pos word::2++ => pos 
pos word::1 => pos 
pos word::1 => pos 
pos word::++1 => pos 
pos word::1++ => pos 
pos word::-1 => pos 
pos word::1- => pos 
pos word::1<> => pos 
pos word1 => pos 
pos word-1 => pos

 

  
# The same rules as the preceding ones,  
# but without conditioning on the POS 
word::5 => pos 
word::5 => pos 
word::-5 => pos 
word::5- => pos 
word::4 => pos 
word::4 => pos 
word::4++ => pos 
word::++4 => pos 
word::-4 => pos 
word::4- => pos 
word::3 => pos 
word::3 => pos 
word::++3 => pos 
word::3++ => pos 
word::3- => pos 
word::-3 => pos 
word::2 => pos 
word::2 => pos 
word::-2 => pos 
word::2- => pos 
word::++2 => pos 
word::2++ => pos 
word::1 => pos 
word::1 => pos 
word::1<> => pos 
word::++1 => pos 
word::1++ => pos 
word::-1 => pos 
word::1- => pos 
# Bigram cooccurrence predicates: 
# Test on the previous word 
word-1 => pos 
# Test on the next word 
word11 => pos

 


There are several steps which need to be taken when performing the lexical training:

  1. The training data (already in the column format) has to be split into 2 parts (equal or not);
  2. From the first part, most likely information needs to be extracted;
  3. Using the most likely POS information, assign the initial tag to the unknown words from the second part; a word from the second part is considered unknown if it didn't appear in the first part;
  4. Generate the unigram list from the first part, or from some large corpus;
  5. Generate a bigram list from the current corpus, a large corpus or both;
  6. Create the rule template file and the parameter file;
  7. Run the fnTBL-train program.


5.2.2 Contextual POS Tagging

After the initial stage, when POS is guessed for unknown words, a second learning process is to be applied, one that learns to correct POS in context. To do this, the learner can use a combination of one of the following basic predicate types5.4:

An example of template file can be found in Table 5.2.2

pos_0 word_0 word_1 word_2 => pos
pos_0 word_-1 word_0 word_1 => pos
pos_0 word_0 word_-1 => pos
pos_0 word_0 word_1 => pos
pos_0 word_0 word_2 => pos
pos_0 word_0 word_-2 => pos
pos_0 word:[1,2] => pos
pos_0 word:[-2,-1] => pos
pos_0 word:[1,3] => pos
pos_0 word:[-3,-1] => pos
pos_0 word_0 pos_2 => pos
pos_0 word_0 pos_-2 => pos
pos_0 word_0 pos_1 => pos
pos_0 word_0 pos_-1 => pos
pos_0 word_0 => pos
pos_0 word_-2 => pos
pos_0 word_2 => pos
pos_0 word_1 => pos
pos_0 word_-1 => pos
pos_0 pos_-1 pos_1 => pos

pos_0 pos_1 pos_2 => pos
pos_0 pos_-1 pos_-2 => pos
pos_0 pos_1 => pos
pos_0 pos_-1 => pos
pos_0 pos_-2 => pos
pos_0 pos_2 => pos
pos_0 pos:[1,3] => pos
pos_0 pos:[1,2] => pos
pos_0 pos:[-3,-1] => pos
pos_0 pos:[-2,-1] => pos
pos_0 pos_1 word_0 word_1 => pos
pos_0 pos_1 word_0 word_-1 => pos
pos_-1 pos_0 word_-1 word_0 => pos
pos_-1 pos_0 word_0 word_1 => pos
pos_-2 pos_-1 pos_0 => pos
pos_-2 pos_-1 word_0 => pos
pos_1 word_0 word_1 => pos
pos_1 word_0 word_-1 => pos
pos_0 pos_1 pos_2 => pos
pos_0 pos_1 pos_2 word_1 => pos

 


Example of contextual rule templates

.

The steps that need to be done to run the contextual tagging are:

  1. Compute, from the training data, the most likely POS tag associated with the words. This is probably well enough computed at the previous step, but if you want to compute it again, using the entire data, now is the time5.5.
  2. Use the lexical rules obtained at the previous step to guess the POS of all the unknown words in the training data - this is done by applying fnTBL to the training data, with appropriate flags:

    $\displaystyle \textrm{fnTBL }<training\_data>\textrm{ }<lexical\_rules>\textrm{...
...ght\rangle \textrm{ }-o\textrm{ }\left\langle train\_output\_file\right\rangle $

  3. Run the learning command

    $\displaystyle \textrm{fnTBL}-\textrm{train }\left\langle train\_output\_file\ri...
... context\_param\_file\right\rangle \textrm{ }\left\langle options\right\rangle $

    where the options are presented in Section 4.2.
Some observations:


5.2.3 TBL POS Tagging Without Headaches

Fortunately, there are 2 scripts provided, pos-train.prl and pos-apply.prl, which do most of these tasks, with little intervention.

5.2.3.1 Learning the Rule List

For the purpose of training a TBL POS tagger, one can use the script pos-train.prl. It should be called as:

pos-train.prl <parameters> <train_file>
where the options are:


-B <bigram_file>

defines the file containing the large quantity of bigrams; if undefined, the bigrams are extracted from the training file;

-f <cutoff>

defines the cutoff to be used for contextual features with the contextual training;

-o <outfile1>,<outfile2>

defines the names the two split files will have; by default they are <train_file>.part1, <train_file>.part2;

-r <ratio>

defines the ratio of the split between the first file and second file (0.3 would split it 0.3/0.7, with 0.3 being the one from which the counts are extracted);

-t <NC>,<NP>

defines the POS symbols for the common noun and proper noun; by default the values are NN and NNP, but they need to be specified for other languages;


-R <lexrulefile>,<contextrulefile>


specifies the names of the 2 rule files to be output: the lexical rule file and the contextual rule file - by default lexical.rls and context.rls;

-T <thresh1>,<tresh2>

defines the learning thresholds for the lexical/contextual tagging;

-u <unigram_file>

defines the file containing the large vocabulary; if not present, the unigrams are extracted from the training file;

-v

turns on the verbose output - useful for debugging and/or see what sequence of commands the program executes;

Some observations about the training procedure:

5.2.3.2 Applying a Rule List

The application of a rule list to a corpus consists of 2 phases also: first, we apply the lexical rules, to guess the POS for unknown words, and then we apply the contextual rules. One can call the fnTBL program directly, or call the tbl-apply.prl script. This script is called as:

tbl-apply.prl <options> <test-file>
where the options are:



-F <lexparam>,<contextparam>


defines the 2 parameter files; as a start, one could use the tbl.lexical.params and tbl.context.pos.params provided with the distribution;

-t <NC>,<NP>


defines the POS symbols for the common noun and proper noun; by default the values are NN and NNP, but they need to be specified for other languages;


-R <lexrulefile>,<contextrulefile>


specifies the 2 rule files: the lexical rule file and the contextual rule file;


<test_file>



the text to be tagged; can be in 1 or 2 column format.


-m <most_likely_tag_file>


defines which file should be used to initialize the test data - it should the "largest" data available.


-o <output_file>


defines the name of the output file; if not present the output will be in a file called <test_file>.res;


-v


turns on verbose output - useful for debugging and to see what the script does.

Some observations about the process of applying a POS rule list:


5.3 Word Sense Disambiguation

Word sense disambiguation is the task of assigning predefined senses to particular words in context. It can be cast as either a lexical choice task (where the goal is to determine the sense for one particular word in a given context) or a all-words task (assign senses to all nouns, verbs, adjectives and adverbs in a sentence) - the task we are presenting here is the lexical choice task. In this task, the samples are independent, as the classification of one instance of a word can be considered independent from the other classifications5.8.

5.3.1 Data Preparation

The samples in this task consist of a number of sentences around or before the word of interest; a number of such samples is presented for each word and are (supposed to be) balanced on the senses. As preprocessing we applied the following steps:

For more information on the processing, see [YCF$^+$01].

5.3.2 Rule Types

The features that the TBL system has access to the following feature type:

The difference between the labeled words (e.g. word-3left) and the unlabeled ones (word-in-a-3-word-window) is that a rule depending on the predicate word-3left=manufacturing will fire if the manufacturing appears exactly 3 words back from the word position, while the predicate word-in-a-3-word-window=manufacturing will be true if the word manufacturing appears anywhere in a 3 word window around the desired word.

One way to create this data is the following (we assume that we want 2 window-sizes for the set features 7 and 100):

and repeat the process for lemmas as well. Then add the following line to the parameter file:

$\displaystyle \textrm{NULL}\_\textrm{FEATURES}=-;$

this will prevent the toolkit from considering predicates of the form word-in-a-3-word-window=-, which are not really useful and will just increase the computation time5.10.

Since the input to the task can vary considerably, we did not provide any tools to prepare the data; it should not be very hard for the reader to create the data files, as they require just normal perl processing. In the test-cases/wsd directory we provided a parameter file, together with the file.templ and rule.templ files corresponding to the 3 and 100 windows described in this Section. Also provided are training and test data for the verb strike, as extracted from the Senseval2 data.

Another interesting observation is that, in the Senseval data, some samples had multiple classifications. fnTBL toolkit can handle this case if one separates the values with a special character, which should be defined in the parameter file. For instance,

$\displaystyle \textrm{TRUTH}\_\textrm{SEPARATOR}=\vert;$

will defined the character separator to be '|' -- values such as strike%2:35:03::|strike_a_chord%2:31:00::5.11 will be interpreted correctly as one of the senses and rules will be generated such that they correct at least one of the senses; the concatenation is not treated as a new truth. If the parameter file does not contain such a definition, then the values such as the one described above will be considered to be classifications.

5.3.3 Training the System

The training is done in the usual way, once the files are created. The command to be run is

fnTBL-train myfile.init wsd.rls -F <param_file> [-threshold <n1>] 
~~[-allPositiveRules <n2> [-minPositiveScore <n3>]]

Some observations:

When TBL is allowed to consider the redundant rules its performance is very similar to the one of a decision list-based algorithm, similar to the one described in [Yar95], that has access to the same features as the TBL algorithm.

5.3.4 Testing the System

To test the system, one runs the command:

fnTBL mytest.init wsd.rls -F <param.file> [-printRuleTrace] [-o <outfile>]

The directory test-cases/wsd contains a README file which has all the commands needed to run the fnTBL system on the verb strike data provided.


next up previous contents
Next: Acknowledgements Up: Fast Transformation-Based Learning Toolkit Previous: 4. Using The Toolkit   Contents
Radu Florian 2001-09-12