Next: 4. Using The Toolkit Up: Fast Transformation-Based Learning Toolkit Previous: 2. An Introduction to Contents

Subsections

3. System Description

The fnTBL system has 2 executables:

fnTBL-train - that learns the rules that are to be applied
fnTBL - that applies the rules to the new test data

Both of them have a set of command line parameters, which will be discussed in the following subsections, and read also from a parameter file, where some parameters that can be changed from problem to problem, are specified.

3.1 The Parameter File

The parameter file is used to store all the parameters that do not change often during the development of a particular task. We found it to be useful, because it reduces significantly the number of command-line parameters that need to be specified, and modifying one parameter does not require the program to be recompiled (as it would if the parameters would be hard-wired in the program).

Its format is as follows:

$\displaystyle <parameter\_name><space*>=<space*><parameter\_value>;$

where:

parameter_name represents the name of the parameter
parameter_value represents the value of the parameter

The parameter value can also contain previously defined parameters, if they have the dollar sign $\$$ in them. For instance, $\$\{\textrm{MAIN}\}$ is a legal value if the variable MAIN was previously defined. I found this to be useful if one wants to define a main directory, where all the other files reside; when one changes the location of the main directory, only one line needs to be modified.

The following are legal examples of lines in the parameter file:

MAIN = /remote/bigram/usr/home/rflorian/workdir/research/data/mtbl-toolkit/text_chunking;
FILE_TEMPLATE = ${MAIN}/file.templ;
RULE_TEMPLATES = ${MAIN}/lexical_predicates.richer.templ;
REASONABLE_SPLIT = 10;
REASONABLE_DT_SPLIT = 5;
EMPTY_LINES_ARE_SEPARATORS = 0;
ELIMINATION_THRESHOLD = 0;

The example is actually extracted from a parameter file associated with the fnTBL POS-tagger. A complete set of valid parameters are provided in the Appendix. The most important ones are:

FILE_TEMPLATE - contains the name of the file with the sample feature names (see 3.2.2);
RULE_TEMPLATES - contains the name of the file with the rule description (see 3.2.3);
CONSTRAINTS_FILE - contains the name of the constraints file (see 3.2.5);
LOG_FILE - contains the name of a file where all the commands will be written - useful to keep track what commands were used, if the parameter is defined;
EMPTY_LINES_ARE_SEPARATORS - describes if the empty lines are to be considered separators or not. If the samples are interdependent (e.g. POS tagging), then the value should be 1 (they are separators). If the samples are independent, then it should be 0. The reason for its existence is that sometimes even if the samples are independent, one might want to separate the samples into sentences (for instance, in lexical POS tagging, Section 5.2)

3.2 File Formats

This section will briefly describe the file formats used with the fnTBL toolkit. In all the following files, the lines starting with a '#' sign are considered as comments and, therefore, are ignored.

3.2.1 Training and Test File Formats

Both the training file and the test file need to be in a particular format for the fnTBL tools to work properly. The format requires that each sample be on a separate line, with the features separated by white space (spaces or tabular characters). In addition, if the samples are interdependent (like in the POS tagging case), and organized into ``blocks'' (i.e. sentences), the blocks need to be separated by a blank line.

Here's an example of such a file:

Revenue NN NN
rose VBD VBD
5 CD CD
% NN NN
to TO TO
$ $ $
282 NN CD
million CD CD
from IN IN
$ $ $
268.3 NN CD
million CD CD
. . .

The DT DT
drop NN NN
in IN IN
earnings NNS NNS
had VBD VBD
been VBN VBN
anticipated VBN VBN
by IN IN
most JJS JJS
Wall NNP NNP
Street NNP NNP
analysts NNS NNS
, , ,
but CC CC
the DT DT
results NNS NNS
were VBD VBD
reported VBD VBN
after IN IN
the DT DT
market NN NN
closed VBD VBD
. . .

There are 3 fields for each word (in this case)^3.1:

The word itself (e.g. closed);
The most likely part-of-speech associated with the word (e.g. VBD);
The true POS of the word^3.2 (e.g. VBN).

As presented in the Section 2.1, the system will start to learn transformations that correct the second field to match, as much as possible, the third field.

Both the training and test data should have this format. Some tools that compute the most likely tag given the word (for instance) are provided with the fnTBL distribution and are described in the Appendix section.

3.2.2 FILE_TEMPLATES

The names of the fields associated with the samples are defined in a file whose name is specified in the parameter file. The name of the parameter is FILE_TEMPLATES and the file contained in this parameter should have the following format:

$\displaystyle \left\langle feature_{1}\right\rangle \ldots \left\langle feature... ...\left\langle truth_{1}\right\rangle \ldots \left\langle truth_{m}\right\rangle$

where

$\left\langle feature_{i}\right\rangle$ is the name of the $i^{\textrm{th}}$ feature associated with the sample;
$\left\langle class_{j}\right\rangle$ is the name of the $j^{\textrm{th}}$ ``guess'' associated with the sample;
$\left\langle truth_{j}\right\rangle$ is the name of the $j^{\textrm{th}}$ ``truth'' associated with the sample; there are the same number of $\left\langle class\right\rangle$ and $\left\langle truth\right\rangle$ features.

To make this clearer, here's an example of a template file, taken from POS tagging:

$\displaystyle \textrm{word pos }=>\textrm{tpos}$

In this example, the feature is word, pos is the name of the ``guess'' of the system at some point in time, and tpos is the name of the truth. If one wants to perform a simultaneous classification of POS tagging and text chunking (see [FN01]), then the corresponding file should be:

$\displaystyle \textrm{word pos chunk }=>\textrm{ tpos tchunk}$

where the features are again word, the guesses are named pos and chunk and the truths (there are 2 classifications in this case) are named tpos and tchunk.

3.2.3 RULE_TEMPLATES

Because the fnTBL is a general-purpose tool, the user needs to specify the rule templates - in other words, to describe the kind of rules the system will try to learn. The description is done by the templates: a template defines which features are checked when a transformation is proposed. For instance, if one wants to implement a bigram look-up, then the template consists of the previous word and the current word.

The predicate of a rule is created as a conjunction of smaller, atomic, predicates that can be as simple as feature identities; other types verify if the word ends/begins with a specific suffix/prefix and yet others verify the identity of one of the previous words. This particular choice of conjunction is selected for:

Simplicity - it is easier to represent the rules and templates this way, both internally and externally (as text);
Speed-up - the algorithm exploits heavily this choice; if disjunction had been allowed between atomic predicates, the algorithm would have been a lot more complicated and no significant speed-up would have been obtained.

A list of atomic predicates' notation is presented after the rule description.

The format of a rule template is

$\displaystyle \left\langle p_{1}\right\rangle \ldots \left\langle p_{k}\right\rangle =>\left\langle c_{1}\right\rangle \ldots \left\langle c_{l}\right\rangle$

where:

$\left\langle p_{i}\right\rangle$ are the atomic predicates, with $\left\langle feature_{1}\right\rangle \ldots \left\langle feature_{n}\right\ra... ...\left\langle class_{1}\right\rangle \ldots \left\langle class_{m}\right\rangle$ as arguments
$\left\langle c_{j}\right\rangle$ are the classifications that are to be changed by the rule, and are chosen from $\left\langle class_{1}\right\rangle \ldots \left\langle class_{m}\right\rangle$

For example

$\displaystyle \textrm{word}\_-1\textrm{ word}\_0=>\textrm{pos}$

is defining a rule that based on the previous and current word will change the POS feature, while

$\displaystyle \textrm{pos}\_-2\textrm{ pos}\_-1\textrm{ pos}\_0=>\textrm{ pos}$

will change the POS of a word based on a POS trigram ending on the current position.

Motivated by some of the problems TBL has been applied to, some types of atomic predicates have been implemented. Here is the list of how they can be specified:

Feature identity:

$\displaystyle \left\langle feature\right\rangle$

will check the identity of a particular feature.
A feature of a neighboring sample (in the case of interdependent samples):

$\displaystyle \left\langle feature\right\rangle \_\left\langle index\right\rangle$

e.g. word_1, pos_-1, chunk_0. $\left\langle index\right\rangle$ can be negative as well; there is a restriction that the number has to be an integer belonging to the range $\left[ -128,127\right]$ ;
A feature that checks the presence of a particular feature in sequence of samples:

$\displaystyle \left\langle feature\right\rangle :\left[ \left\langle index\_start\right\rangle ,\left\langle index\_end\right\rangle \right]$

e.g. word:[1,3] checks the presence of the particular word in the samples on positions +1,+2 or +3, that is:

$\displaystyle word:\left[ 1,3\right] =the\textrm{ returns true on a sample }s_{... ... s_{i+1}\textrm{ or }s_{i+2}\textrm{ or }s_{i+3}\textrm{ contains the word }the$
A feature checking that one of the features present in an enumeration has a particular value (useful for independent samples, see the PP attachment case study):

$\displaystyle \left\{ \left\langle feature_{1}\right\rangle ,\ldots ,\left\langle feature_{n}\right\rangle \right\}$

e.g. $\left\{ class_{1},class_{2}\right\} =\textrm{use}\%2:34:01::$ will return true on a given sample if and only if one of the features $class_{1}$ or $class_{2}$ of sample is $\textrm{use}\%2:34:01::$ ^3.3.

One of the big advantages of the fnTBL toolkit is that the predicate structure is modular, and adding a new atomic predicate type to the toolkit should not present a big programming challenge; see the section about the code design for more information.

To make life easier, an additional construction is allowed in the rule file format - definitions of a rule placeholder:

$\displaystyle \left\langle variable\_name\right\rangle =\left\langle rule\_definition\right\rangle$

once a variable is defined, it can be recalled as $\$\left\langle variable\_name\right\rangle$ in the rule template file. This construction helps mostly with ``set'' atomic predicates (the predicate will return true if any one of a set of features has a given value). The following is an rule template example taken from the PP-attachment problem:

$\begin{displaymath} \begin{array}{l} verb\_parent=\left\{ vp1,vp2,vp3,vp4\right\... ...verb\_parent\, \$noun\_parent\Rightarrow attachment \end{array}\end{displaymath}$

In here, a predicate-set placeholder is defined - $verb\_parent$ can replace the predicate $\left\{ vp1,vp2,vp3,vp4\right\}$ in subsequent rules. The rule template $\$verb\_parent\, \$noun\_parent\Rightarrow attachment$ will try to predict the attachment of a preposition using wordnet parents of the verb and the first noun of the sample.

3.2.4 The Rule File Format

Both the main programs from the fnTBL toolkit (fnTBL and fnTBL-learn) use rule files files which list the rules learned by the system. These rules have meaningful linguistic information in them, and can be easily read and understood. The format they are presented in is similar to the one of the template files, rules being actually instantiated templates. The format is:

$\displaystyle \left\langle pred_{1}\right\rangle =\left\langle value_{1}\right\... ...dots \left\langle class_{p}\right\rangle =\left\langle cvalue_{p}\right\rangle$

where $\left\langle pred_{i}\right\rangle$ is an atomic predicate, $\left\langle value_{k}\right\rangle ,\left\langle cvalue_{k}\right\rangle$ are valid values for the respective predicates and $\left\langle class_{p}\right\rangle$ are the classes to be changed. For instance, again for the POS tagging problem, the rule

$\displaystyle POS_{-1}=DT\, \, POS_{0}=VB=>POS=NN$

will change all the occurrences of infinitive verbs to nouns if the previous part-of-speech is a determiner; similarly, the rule

$\displaystyle preposition=\textrm{at}=>attachment=\textrm{v}$

will change the attachment from noun to verb if the preposition is at, in the case of prepositional phrase attachment.

3.2.5 The Constraints File

One less usual parameter for the fnTBL algorithm is the constraints file. The TBL algorithm suffers from one relatively major problem - since it's output is not probabilistic, it has no mechanism of estimating the posterior probability of a class given a sample

$\displaystyle P\left( c\vert s\right)$

and therefore it might assign classes to samples that might not make any sense at all. For instance, it might assign to the word the the POS tag VB, which is, of course, very unlikely. Most probabilistic classifiers do not suffer from this problem, as they have appropriate ways of computing the posterior probability $p\left( c\vert s\right)$ . To avoid this problem, the TBL algorithm proposed originally by Eric Brill enforced that no new classes should be assigned to samples: if a sample has been seen in training, then a rule that would change the classification to a classification that has never been seen with that particular sample is not allowed to apply.

We are relaxing here this constraint by allowing the user to specify the set of allowable associations feature-class for a set of examples, by using constraint files. If a particular feature is not present in these constraint files, then a rule that examines that feature can change it as it sees fit. However, if the feature is present in the file, only rules that change it to some specified class are allowed to apply to it. This mechanism allows the user to specify, for instance, that only words which were seen enough times in the data are not allowed to be associated with new classes.

The format of the constraint file is:

$\displaystyle \left\langle feature_{1}\right\rangle \, \ldots \, \left\langle f... ...langle classification_{j}\right\rangle \, \left\langle file\_name\right\rangle$

where $\left\langle feature_{i}\right\rangle _{i}$ are features on which the conditioning needs to be done, $\left\langle classification_{j}\right\rangle$ is the classification that is constrained and $\left\langle file\_name\right\rangle$ is the file containing the constraints - a file containing a list of the pairs that need be constrained. For instance, in the POS tagging example, a constraint in the constraint file could be

$\displaystyle word\textrm{ }pos\textrm{ constraints}.\textrm{dat}$

and the file constraint.dat might contain

saying NN VBG
the DT
their PRP$
them PRP

Any combination of features that does not appear in the constraint file are not constrained at all, and it's possible for those samples to be assigned any classification. Also, it is useful to note that the constraints are stackable; in other words, first a sample is checked against all the constraints - if it passes, then the class change is performed.

Next: 4. Using The Toolkit Up: Fast Transformation-Based Learning Toolkit Previous: 2. An Introduction to Contents

Radu Florian 2001-09-12