next up previous contents
Next: 5. Test Cases Up: Fast Transformation-Based Learning Toolkit Previous: 3. System Description   Contents

Subsections


4. Using The Toolkit

4.1 Creating the Training/Test Files

The first step in using the fnTBL toolkit is to create the training and test files. Even though there are some tools provided with the toolkit that help with the file creation, most preprocessing is left to the user. No tokenization or end-of-sentence detection is performed (even if this may change, as it easy to train a TBL system to perform EOS detection).

Once the corpus is in the required format (one word per line), the tools provided can augment it with the most likely tag given some features (for instance, given the word), and can construct the constraint files. In the POS tagging case, an almost complete solution to creating the initial files is provided (see the POS test case).

In this initial release, the probability model does not work properly, and it should not be used. This problem will be fixed in the next release.


4.2 Training using fnTBL

Once the training file is in place, the rule file can be generated. This is the most time-intensive step of the process, as the main search is performed during training. To train from a corpus, the user should use the command:

fnTBL-train <train_file> <rule_output_file> [-F <param_file>] [-threshold <stop_thr>] [-pv] [-t <tree_prob_file>] [other options]

where:

The other options are:

While the program is running, it will output the rules that are selected as it progresses, along with the time it took to compute the current rule. At the end, the total running time is also printed. If you don't want this output, you can redirect the stderr to /dev/null, by using a command like

$\displaystyle <command>>\&/dev/null$

if you're using tcsh or

$\displaystyle <command>2>/dev/null$

if you're using bash/ksh etc.

Also, the <param_file> parameter can be specified as the value of the shell variable DDINF. So, if you don't want to specify it as a command line parameter, you can set the shell variable to point to the appropriate file, and it will be used just the same.


4.3 Classification using fnTBL

Once the rule file is generated, or if you have the file from other sources, and the test file is in the appropriate format, fnTBL program can be used to apply the rules, by using the command:

fnTBL <test_file> <rule_file> [-o <out_file>] [-F <param_file>] [-printRuleTrace]
where:

Important observation: the initial set-up of the test data should be equivalent to the initial set-up of the training data (i.e. using the same most-likely distribution), or the program will not behave properly, as it is to be expected.

A scoring program is also provided, as described in Section A.2.

4.4 Warnings and Errors

One possible warning that can be output by the program has the following form:

$\displaystyle !!\textrm{ pos}\_0=\textrm{VB pos}\_1=\textrm{NN}\Rightarrow \tex...
...not have }0\textrm{ goods and }0\textrm{ bads }(\textrm{good}:2\textrm{ bad}:0)$

This message warns the user that after applying a rule, its counts are not 0. Normally, after applying a rule to the corpus, the rule does not apply anymore, therefore its good and bad counts should be 0. However, there are cases where this is not true: consider the case of a ``recursive'' rule4.1 such as the above mentioned one. After applying the rule in the following context:
$\displaystyle \textrm{the}$ $\displaystyle \textrm{DT}$ $\displaystyle \textrm{DT}$  
$\displaystyle \textrm{will}$ $\displaystyle \textrm{VB}$ $\displaystyle \textrm{NN}$  
$\displaystyle \textrm{reading}$ $\displaystyle \textrm{VB}$ $\displaystyle \textrm{NN}$  
$\displaystyle \textrm{session}$ $\displaystyle \textrm{NN}$ $\displaystyle \textrm{NN}$  

(on line 3), the rule is applicable again on line 2, while it was not before the application. If the rule is not recursive, then you probably found a bug in the fnTBL code.

A definite error message may appear:

$\displaystyle \textrm{Oups}:\textrm{ you found a bug in the TBL code}$

This message appears if the last rule has been selected 5 times in a row - it usually happens if there is a bug in the updating score of the system. Since the algorithm needs to keep track of the exact counts for each rule at every step of the algorithm, any mistake in count-keeping will result in this error, sooner or later. If you manage to obtain this message, please contact the authors and submit a bug report - see Section 4.8 for details.

4.5 Rule Interaction

As mentioned in Section 2.1, the algorithm will select the rules in the decreasing order of their score. At some point, however, will have to choose between rules that have identical scores. One option would be for it to choose randomly; this is a not-too-desirable behavior, since the results are no longer entirely replicable. Therefore, we made this decision making completely deterministic, as follows: given 2 rules $ r_{1} $ and $ r_{2} $ we decide to choose $ r_{1} $ over $ r_{2} $ if and only if the following condition holds:

  1. $ score\left( r_{1}\right) >score\left( r_{2}\right) $ or
  2. $ score\left( r_{1}\right) =score\left( r_{2}\right) $ and

    1. $ r_{1} $ has more tokens than $ r_{2} $ (more atomic predicates) or
    2. $ r_{1} $ and $ r_{2} $ have the same number of atomic predicates and

      1. the template of $ r_{1} $ was declared before the template of $ r_{2} $ or
      2. $ r_{1} $ and $ r_{2} $ have the same template and the target of $ r_{1} $ has a lower index in the vocabulary as the target of $ r_{2} $.
The procedure is rather complicated, but the choices are based on the experience the authors had in developing the toolkit. Choice 2a may seem a little strange, since is the opposite of Occam's razor, we think it is better to make finer decisions than more general ones, as the finer ones will result in errors less often at the expense of the rule not applying that often - basically, we prefer precision to recall.

To leave more room for experimentation, we have also implemented the rule ordering such that it allows the user to specify the method by which ties are broken. The parameter ORDER_BASED_ON_SIZE (see A), will choose among the following options:

  1. The method described above - ORDER_BASED_ON_SIZE=0;
  2. The method described above, but with 2a reversed (i.e. choose the rules with less atomic predicates) for ORDER_BASED_ON_SIZE=1;
  3. The size is not considered when comparing the predicates - the comparison is based on the order they appear in the rule template file - for ORDER_BASED_ON_SIZE=2.
While the choice 3 is the one that gives the most freedom, it can be a little annoying to think about the relationships between each rule template; if you don't feel like doing it, just select one of the other ones. The default is $ 0 $.


4.6 Termination Conditions

The training phase finishes when either of the following conditions are met:

The transformation-based algorithm suffers from a serious drawback when the training data is small: it does not allow for redundancy. If there are 2 good explanations of a phenomenon (observed through 2 rules that have similar scores), only one will be selected by the process, while the second one will be completely discarded. If the training data is sparse, it can happen that the first rule does not apply to a particular sample, while the second one does. This phenomenon was observed by the authors especially when the training samples are independent; section 5.3 presents such a case.

To help alleviate this problem, the algorithm can be amended as follows:

This small change has the some advantages:

One important observation: if one wants to use this feature, one should have defined rule templates that do not depend on the classification. For instance, beside having the rule

$\displaystyle \textrm{word}\_-1\textrm{ pos}\_0\, \Rightarrow \, \textrm{pos}$

you should also have the rule

$\displaystyle \textrm{word}\_-1\, \Rightarrow \, \textrm{pos}$

While this might seem as a strange condition, it is made necessary by the fact that if all the rule templates are dependent on the current classification, then the ``positive'' rules generated at the end will either have a score lower than the best rule (otherwise they would have been the best rule) or will have a non-useful form (e.g. word_-1 pos_0=NN $ \Rightarrow $ pos=NN).

Finally, the rules are output in the direct order of their good counts, such that better rules get the final decision in deciding the classification of a sample.

4.7 Data Representation Size

The fnTBL toolkit tries to keep the amount of memory used to a minimum. To do this, it is set-up by default to use the following data types:

The user has the possibility of adjusting the sizes of these types, making the approach usable when the data requires it, as follows:


4.8 Bug Reports

The fnTBL toolkit is still in its infancy and it is possible that there are still bugs in the code. We haven't observed any for a few months now, but the toolkit was constructed by us, and it is possible that we never did something to break it. If you manage to break it in any way (that includes the 2 executables and the scripts that came with the toolkit), you are invited to report the bug to the authors, either by e-mail at one of the following addresses:

$\displaystyle \textrm{rflorian}@\textrm{cs}.\textrm{jhu}.\textrm{edu}$      
$\displaystyle \textrm{gyn}@\textrm{cs}.\textrm{jhu}.\textrm{edu}$      
$\displaystyle \textrm{ fnTBLtk}@\textrm{nlp}.\textrm{cs}.\textrm{jhu}.\textrm{edu}$      

or using one of the links from the main fnTBL toolkit web page:

$\displaystyle \textrm{http}://\textrm{nlp}.\textrm{cs}.\textrm{jhu}.\textrm{edu}/\textasciitilde \textrm{tbl}-\textrm{toolkit}.\textrm{html}$

When submitting a bug report, we are asking you to submit the following information:

  1. A brief description of how the bug was obtained;
  2. The configuration files used when you obtained the bug;
  3. The training data file and/or the rule files that were used, if possible.
The data files will make the debugging process a lot easier; since we can imagine that giving the authors access to your data files might make you uncomfortable, we promise not to use the data in any other way than for debugging the process and also to delete it as soon as the bug was fixed; we are even willing to sign NDAs, if required.


next up previous contents
Next: 5. Test Cases Up: Fast Transformation-Based Learning Toolkit Previous: 3. System Description   Contents
Radu Florian 2001-09-12