4.7 File formats
[
Guide contents
| Chapter
contents | Next section: 4.8 Trace
Formats | Previous section: 4.6
Alphabetic List of Commands ]
Figure 4.1 Structure of the rules file
Figure 4.2 A sample rules file
Figure 4.3 Structure of the main lexicon file
Figure 4.4 A sample main lexicon file
Figure 4.5 Structure of a lexical entry
Figure 4.6 A sample lexical entry
Figure 4.7 Structure of the grammar file
Figure 4.8A A lexical rule example
Figure 4.8B Feature structure before application of lexical rule
Figure 4.8C Feature structure after application of lexical rule
Figure 4.9 A sample grammar file
Figure 4.10 A sample generation comparison file
Figure 4.11 A sample recognition comparison file
Figure 4.12 A sample pairs comparison file
Figure 4.12A A sample synthesis comparison file
Figure 4.13 A sample generation file
Figure 4.14A A sample synthesis file
Figure 4.15 Default file names and extensions
This section describes the formats for the files that are used as input to
PC-KIMMO. In any of the files, comments can be added to any line by preceding
the comment with the comment character. This character is normally a semicolon
(;), but can be changed with the COMMENT keyword in the rules file. Anything
following a comment character (until the end of the line) is considered part of
the comment and is ignored by PC-KIMMO.
In the descriptions below, reference to the use of a space character implies
any whitespace character (that is, any character treated like a space
character). The following control characters when used in a file are whitespace
characters: ^I (ASCII 9, tab), ^J (ASCII 10, line feed), ^K (ASCII 11, vertical
tab), ^L (ASCII 12, form feed), and ^M (ASCII 13, carriage return).
The control character ^Z (ASCII 26) cannot be used because MS-DOS interprets
it as marking the end of a file. Also the control character ^@ (ASCII 0, null)
cannot be used.
Examples of each of the following file types are found on the release
diskette as part of the English description.
The general structure of the rules
file is a list of keyword declarations. Figure4.1 shows
the conventional structure of the rules file. Note that the notation {x |
y} means either x or y (but not both). The following
specifications apply to the rules file.
Figure 4.1 Structure of the rules file
COMMENT <character>
ALPHABET <symbol list>
NULL <character>
ANY <character>
BOUNDARY <character>
SUBSET <subset name> <symbol list>
. (more subsets)
.
.
RULE <rule name> <number of states> <number of columns>
<lexical symbol list>
<surface symbol list>
<state number>{: | .} <state number list>
. (more states)
.
.
. (more rules)
.
.
END
- Extra spaces, blank lines, and comment lines are ignored.
- Comments may be placed anywhere in the file. All data following a comment
character to the end of the line is ignored. (See below on the COMMENT
declaration.)
- The set of valid keywords used to form declarations includes COMMENT,
ALPHABET, NULL, ANY, BOUNDARY, SUBSET, RULE, and END.
- These declarations are obligatory and can occur only once in a file:
ALPHABET, NULL, ANY, BOUNDARY.
- These declarations are optional and can occur one or more times in a file:
COMMENT, SUBSET, and RULE.
- The COMMENT declaration sets the comment character used in the rules file,
lexicon files, and grammar file. The COMMENT declaration can only be used in
the rules file, not in the lexicon or grammar file. The COMMENT declaration is
optional. If it is not used, the comment character is set to ; (semicolon) as
a default.
- The COMMENT declaration can be used anywhere in the rules file and can be
used more than once. That is, different parts of the rules file can use
different comment characters. The COMMENT declaration can (and in practice
usually does) occur as the first keyword in the rules file, followed by either
one or more COMMENT declarations or the ALPHABET declaration.
- Note that if you use the COMMENT declaration to declare the character that
is already in use as the comment character, an error will result. For
instance, if semicolon is the current comment character, the declaration
COMMENT ; will result in an error.
- The comment character can no longer be set using a command line option or
with a command in the user interface, as was the case in version 1 of
PC-KIMMO.
- The ALPHABET declaration must either occur first in the file or follow one
or more COMMENT declarations only. The other declarations can appear in any
order. The COMMENT, NULL, ANY, BOUNDARY, and SUBSET declarations can even be
interspersed among the rules. However, these declarations must appear before
any rule that uses them or an error will result.
- The ALPHABET declaration defines the set of symbols used in either lexical
or surface representations. The keyword ALPHABET is followed by a
<symbol list> of all alphabetic symbols. Each symbol must be
separated from the others by at least one space. The list can span multiple
lines, but ends with the next valid keyword. All alphanumeric characters (such
as a, B, and 2), symbols (such as $ and +),
and punctuation characters (such as . and ?) are available as
alphabet members. The characters in the IBM extended character set (above
ASCII 127) are also available. Control characters (below ASCII 32) can also be
used, with the exception of whitespace characters (see above), ^Z (end of
file), and ^@ (null). The alphabet can contain a maximum of 255 symbols. An
alphabetic symbol can also be a multigraph, that is, a sequence of two or more
characters. The individual characters composing a multigraph do not
necessarily have to also be declared as alphabetic characters. For example, an
alphabet could include the characters s and z and the multigraph
sz%, but not include % as an alphabetic character. Note that a
multigraph cannot also be interpreted as a sequence of the individual
characters that comprise it.
- The keyword NULL is followed by a single <character> that
represents a null (empty, zero) element. The NULL symbol is considered to be
an alphabetic character, but cannot also be listed in the ALPHABET
declaration. The NULL symbol declared in the rules file is also used in the
lexicon file to represent a null lexical entry.
- The keyword ANY is followed by a single "wildcard"
<character> that represents a match of any character in the
alphabet. The ANY symbol is not considered to be an alphabetic character,
though it is used in the column headers of state tables. It cannot be listed
in the ALPHABET declaration. It is not used in the lexicon file.
- The keyword BOUNDARY is followed by a single <character>
character that represents an initial or final word boundary. The BOUNDARY
symbol is considered to be an alphabetic character, but cannot also be listed
in the ALPHABET declaration. When used in the column header of a state table,
it can only appear as the pair #:# (where, for instance, # has
been declared as the BOUNDARY symbol). The BOUNDARY symbol is also used in the
lexicon file in the continuation class field of a lexical entry to indicate
the end of a word (that is, no continuation class).
- The SUBSET declaration defines set of characters that are referred to in
the column headers of rules. The keyword SUBSET is followed by the
<subset name> and <symbol list>. <subset
name> is a single word (one or more characters) that names the list of
characters that follows it. The subset name must be unique (that is, if it is
a single character it cannot also be in the alphabet or be any other declared
symbol). It can be composed of any characters (except space); that is, it is
not limited to the characters declared in the ALPHABET section. It must not be
identical to any keyword used in the rules file. The subset name is used in
rules to represent all members of the subset of the alphabet that it defines.
Note that SUBSET declarations can be interspersed among the rules. This allows
subsets to be placed near the rule that uses them if such a style is desired.
However, a subset must be declared before a rule that uses it.
- The <symbol list> following a <subset name> is a
list of single symbols, each of which is separated by at least one space. The
list can span multiple lines. Each symbol in the list must be a member of the
previously defined ALPHABET, with the exception of the NULL symbol, which can
appear in a subset list but is not included in the ALPHABET declaration.
Neither the ANY symbol nor the BOUNDARY symbol can appear in a subset symbol
list.
- The keyword RULE signals that a state table immediately follows.
- <rule name> is the name or description of the rule which the
state table encodes. It functions as an annotation to the state table and has
no effect on the computational operation of the table. It is displayed by the
list rules and show rule commands and is also displayed in
traces. The rule name must be surrounded by a pair of identical delimiter
characters. Any material can be used between the delimiters of the rule name
with the exception of the current comment character and of course the rule
name delimiter character of the rule itself. Each rule in the file can use a
different pair of delimiters. The rule name must be all on one line, but it
does not have to be on the same line as the RULE keyword.
- <number of states> is the number of states (rows in the
table) that will be defined for this table. The states must begin at 1 and go
in sequence through the number defined here (that is, gaps in state numbers
are not allowed).
- <number of columns> is the number of state transitions
(columns in the table) that will be defined for each state.
- <lexical symbol list> is a list of elements separated by one
or more spaces. Each element represents the lexical half of a lexical:surface
correspondence which, when matched, defines a state transition. Each element
in the list must be either a member of the alphabet, a subset name, the NULL
symbol, the ANY symbol, or the BOUNDARY symbol (in which case the
corresponding surface character must also be the BOUNDARY symbol). The list
can span multiple lines, but the number of elements in the list must be equal
to the number of columns defined for the rule.
- <surface symbol list> is a list of elements separated by one
or more spaces. Each element represents the surface half of a lexical:surface
correspondence which, when matched, defines a state transition. Each element
in the list must be either a member of the alphabet, a subset name, the NULL
symbol, the ANY symbol, or the BOUNDARY symbol (in which case the
corresponding lexical character must also be the BOUNDARY symbol). The list
can span multiple lines, but the number of characters in the list must be
equal to the number of columns defined for the rule.
- <state number> is the number of the state or row of the
table. The first state number must be 1, and subsequent state numbers must
follow in numerical sequence without any gaps.
- {: | .} is the final or nonfinal state indicator. This should be a colon
(:) if the state is a final state and a period (.) if it is a nonfinal state.
It must follow the <state number> with no intervening space.
- <state number list> is a list of state transition numbers for
a particular state. Each number must be between 1 and the number of states
(inclusive) declared for the table. The list can span multiple lines, but the
number of elements in the list must be equal to the number of columns declared
for this rule.
- The keyword END follows all other declarations and indicates the end of
the rules file. Any material in the file thereafter is ignored by PC-KIMMO.
The END keyword is optional; the physical end of the file also terminates the
rules file.
Figure
4.2 shows a sample rules file.
Figure 4.2 A sample rules file
ALPHABET
b c d f g h j k l m n p q r s t v w x y z + ; + is morpheme boundary
a e i o u
NULL 0
ANY @
BOUNDARY #
SUBSET C b c d f g h j k l m n p q r s t v w x y z
SUBSET V a e i o u
; more subsets
RULE "Consonant defaults" 1 23
b c d f g h j k l m n p q r s t v w x y z + @
b c d f g h j k l m n p q r s t v w x y z 0 @
1: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
RULE "Vowel defaults" 1 6
a e i o u @
a e i o u @
1: 1 1 1 1 1 1
RULE "Voicing s:z <=> V___V" 4 4
V s s @
V z @ @
1: 2 0 1 1
2: 2 4 3 1
3: 0 0 1 1
4. 2 0 0 0
; more rules
END
A lexicon consists of one main
lexicon file plus one or more files of lexical entries. The general structure of
the main lexicon file is a list of keyword declarations. The set of valid
keywords is ALTERNATION, FEATURES, FIELDCODE, INCLUDE, and END. Figure 4.3
shows the conventional structure of the lexicon file. The following
specifications apply to the main lexicon file.
Figure 4.3 Structure of the main lexicon file
ALTERNATION <alternation name> <sublexicon name list>
. (more ALTERNATIONs)
.
.
FEATURES <feature abbreviation list>
FIELDCODE <lexical item code> U
FIELDCODE <sublexicon code> L
FIELDCODE <alternation code> A
FIELDCODE <features code> F
FIELDCODE <gloss code> G
INCLUDE <filespec>
. (more INCLUDEd files)
.
.
END
- Extra spaces, blank lines, and comment lines are ignored.
- The comment character declared in the rules file is operative in the main
lexicon file. Comments may be placed anywhere in the file. All data following
a comment character to the end of the line is ignored.
- The set of valid keywords used to form declarations includes ALTERNATION,
FEATURES, FIELDCODE, INCLUDE, and END.
- The declarations can appear in any order with the proviso that any
alternation name, feature name, or fieldcode used in a lexical entry must be
declared before the lexical entry is read. In practice, this means that the
INCLUDE declarations should appear last, but the ALTERNATION, FEATURES, and
FIELDCODE declarations can appear in any order.
- The ALTERNATION declaration defines a set of sublexicon names that serve
as the continuation class of a lexical item. The ALTERNATION keyword is
followed by an <alternation name> and a <sublexicon name
list>. ALTERNATION declarations are optional (but nearly always used in
practice) and can occur as many times as needed.
- <alternation name> is a name associated with the following
<sublexicon name list>. It is a word composed of one or more
characters, not limited to the ALPHABET characters declared in the rules file.
An alternation name can be any word other than a keyword used in the lexicon
file. The program does not check to see if an alternation name is actually
used in the lexicon file.
- <sublexicon name list> is a list of sublexicon names. It can
span multiple lines until the next valid keyword is encountered. Each
sublexicon name in the list must be used in the sublexicon field of a lexical
entry. Although it is not enforced at the time the lexicon file is loaded, an
undeclared sublexicon named in a sublexicon name list will cause an error when
the recognizer tries to use it.
- The FEATURES keyword followed by a <feature abbreviation list>.
A <feature abbreviation list> is a list of words, each of
which is expanded into feature structures by the word grammar.
- The FIELDCODE declaration is used to define what fieldcode will be used to
mark each type of field in a lexical entry. The FIELDCODE keyword is followed
by a <code> and one of five possible internal codes: U, L, A, F,
or G. There must be five FIELDCODE declarations, one for each of these
internal codes, where U indicates the lexical item field, L indicates the
sublexicon field, A indicates the alternation field, F indicates the features
field, and G indicates the gloss field.
- The INCLUDE keyword is followed by a <filespec> that names a
file containing lexical entries to be loaded. An INCLUDEd file cannot contain
any declarations (such as a FIELDCODE or an INCLUDE declaration), only lexical
entries and comment lines.
- The keyword END follows all other declarations and indicates the end of
the main lexicon file. Any material in the file thereafter is ignored by
PC-KIMMO. The END keyword is optional; the physical end of the file also
terminates the main lexicon file.
Figure
4.4 shows a sample main lexicon file.
Figure 4.4 A sample main lexicon file
ALTERNATION Begin PREF
ALTERNATION Pref N AJ V AV
ALTERNATION Stem SUFFIX
FEATURES sg pl reg irreg
FIELDCODE lf U ;lexical item
FIELDCODE lx L ;sublexicon
FIELDCODE alt A ;alternation
FIELDCODE fea F ;features
FIELDCODE gl G ;gloss
INCLUDE affix.lex ;file of affixes
INCLUDE noun.lex ;file of nouns
INCLUDE verb.lex ;file of verbs
INCLUDE adjectiv.lex ;file of adjectives
INCLUDE adverb.lex ;file of adverbs
END
Figure
4.5 shows the structure of a lexical entry. Lexical entries are encoded in
"field-oriented standard format." Standard format is an information interchange
convention developed by the Summer Institute of Linguistics. It tags the kinds
of information in ASCII text files by means of markers which begin with
backslash. Field-oriented standard format (FOSF) is a refinement of standard
format geared toward representing data which has a database-like record and
field structure. The following points provide an informal description of the
syntax of FOSF files.
Figure 4.5 Structure of a lexical entry
\<lexical item code> <lexical item>
\<sublexicon code> <sublexicon name>
\<alternation code> {<alternation name> | <BOUNDARY symbol>}
\<features code> <features list>
\<gloss code> <gloss string>
- A field-oriented standard format (FOSF) file consists of a sequence
of records.
- A record consists of a sequence of fields.
- A field consist of a field marker and a field value.
- A field marker consists of a backslash character at the beginning
of a line, followed by an alphabetic or numeric character, followed by zero or
more printable characters, and terminated by a space, tab, or the end of a
line. A field marker without its initial backslash character is termed a
field code.
- A field marker must begin in the first position of a line. Backslash
characters occurring elsewhere in the file are not interpreted as field
markers.
- The first field marker of the record is considered the record marker, and
thus the same field must occur first in every record of the file.
- Each field marker is separated from the field value by one or more
spaces, tabs, or newlines. The field value continues up to the next field
marker.
- Any line that is empty or contains only whitespace characters is
considered a comment line and is ignored. Comment lines may occur between or
within fields.
- Fields and lines in an FOSF file can be arbitrarily long.
- There are two basic types of fields in FOSF files: nonrepeating and
repeating. Repeating fields are multiple consecutive occurrences of
fields marked by the same marker. Individual fields within a repeating field
can be called subfields.
The following specifications apply to how FOSF is implemented in PC-KIMMO.
- Lexical entries are encoded as records in a FOSF file.
- Only those fields whose field codes are declared in the main lexicon file
are recognized (see above on the FIELDCODE declaration). All other fields are
considered to be extraneous and are ignored.
- The first field of each lexical entry must be the lexical item field. The
lexical item field code is assigned to the internal code U by a FIELDCODE
declaration in the main lexicon file.
- Only nonrepeating fields are permitted.
- The comment character declared in the rules file is operative in included
files of lexical entries. All data following a comment character to the end of
the line is ignored.
A file of lexical entries is loaded by using an INCLUDE declaration in the
main lexicon file (see above). An INCLUDEd file of lexical entries cannot
contain any declarations (such as a FIELDCODE or an INCLUDE declaration), only
lexical entries and comment lines.
The following specifications apply to lexical entries.
- A lexical entry is composed of five fields: lexical item, sublexicon,
alternation, features, and gloss. The lexical item, sublexicon, and
alternation, fields are obligatory, the features and gloss fields are
optional. The first field of the entry must always be the lexical item. The
other fields can appear in any order, even differing from one entry to
another.
- Although the gloss field is optional, if a lexical entry does not include
one, a warning message to that effect will be displayed when the entry is
loaded. To supress this warning message, do the command set warnings off
(see section 4.5.6.1)
before loading the lexicon.
- If an entry has an empty gloss field (that is, the field marker for the
gloss field is present but there is no data after it), then the contents of
the lexical form field will be also be used as the gloss for that entry.
- A lexical item field consists of a <lexical item code> and a
<lexical item>.
- A <lexical item code> is a field code assigned to the
internal code L by a FIELDCODE declaration in the main lexicon file.
- A <lexical item> is one or more characters that represent an
element (typically a morpheme or word) of the lexicon. Each character (or
multigraph) must be in the alphabet defined for the language. The lexical item
uses only the lexical subset of the alphabet.
- A sublexicon field consists of a <sublexicon code> and a
<sublexicon name>.
- A <sublexicon code> is a field code assigned to the internal
code L by a FIELDCODE declaration in the main lexicon file.
- A <sublexicon name> is the name associated with a sublexicon.
It is a word composed of one or more characters, not limited to the alphabetic
characters declared in the rules file. Every lexical item must belong to a
sublexicon. Every lexicon must include a special sublexicon named INITIAL
(that is, there must be at least one lexical entry that belongs to the INITIAL
sublexicon).
- Lexical entries belonging to a sublexicon do not have to be listed
consecutively in a single file (as was the case for PC-KIMMO version 1);
rather, lexical entries in a file can occur in any order, regardless of what
sublexicon they belong to. Lexical entries of a sublexicon can even be placed
in two or more separate files.
- An alternation field consists of a <alternation code>
followed by either an <alternation name> or the <BOUNDARY
symbol>.
- An <alternation name> is declared in an ALTERNATION
declaration in the main lexicon file. The <BOUNDARY symbol> is
declared in the rules file and indicates the end of all possible continuations
in the lexicon.
- A features field consists of a <features code> and a
<features list>.
- A <features code> is a field code assigned to the internal
code F by a FIELDCODE declaration in the main lexicon file.
- A <features list> is a list of feature abbreviations. Each
abbreviation is a single word consisting of alphanumeric characters or other
characters except (){}[]<>=:$! (these are used for special
purposes in the grammar file). The character \ should not be used as
the first character of an abbreviation because that is how fields are marked
in the lexicon file. Upper and lower case letters used in template names are
considered different. For example, "PLURAL" is not the same as "Plural" or
"plural." Feature abbreviations are expanded into full feature structures by
the word grammar (see section 4.7.3).
- A gloss field consists of a <gloss code> and a <gloss
string>.
- A <gloss code> is a field code assigned to the internal code
G by a FIELDCODE declaration in the main lexicon file.
- A <gloss string> is a string of text. Any material can be
used in the gloss field with the exception of the comment character.
Figure
4.6 shows a sample lexical entry.
Figure 4.6 A sample lexical entry
\lf `knives
\lx N
\alt Infl
\fea pl irreg
\gl N(`knife)+PL
The grammar file consists of
feature templates, context-free rules, and feature constraints. Figure 4.7
shows the conventional structure of the grammar file.
Figure 4.7 Structure of the grammar file
LET <abbreviation | category> be <feature definition>
. (more feature templates)
.
.
DEFINE <lexical rule name> as <mappings>
. (more lexical rules)
.
.
PARAMETER <parameter name> is <parameter value>
. (more parameter settings)
.
.
RULE <rule>
<feature constraint>
. (more constraints)
.
.
(more rules)
.
.
.
END
The following specifications apply generally to the grammar file.
Rules
The following specifications apply to rules.
A grammar rule has these parts, in the order listed:
- the keyword Rule
- an optional rule identifier enclosed in braces ({})
- the nonterminal symbol to be expanded
- an arrow (->) or equal sign (=)
- zero or more terminal or nonterminal symbols, possibly marked for
alternation or optionality
- an optional colon (:)
- zero or more feature constraints, possibly marked for alternation
- an optional period (.)
The optional rule identifier (item 2) consists of one or more words enclosed
in braces. Its current utility is only as a special form of comment describing
the intent of the rule. (Eventually it may be used as a tag for interactively
adding and removing rules.) The only limits on the rule identifier are that it
not contain the comment character and that it all appears on the same line in
the grammar file.
The terminal and nonterminal symbols in the rule have the following
characteristics:
- Blank lines, spaces, and tabs separate symbols from one another, but
otherwise are ignored.
- Upper and lower case letters used in symbols are considered different. For
example, STEM is not the same as Stem, and neither is the
same as stem.
- Index numbers are used to distinguish instances of a symbol that is used
more than once in a rule. They are added to the end of a symbol following an
underscore character (_). For example,
Stem_1 = Stem_2 SUFFIX
- The symbol X may be used to stand for any terminal or nonterminal
category. For example, this rule says that a N expands into a NStem plus any
category.
N = NStem X
The symbol X can be useful for capturing generalities. Care must be
taken, since it can be replaced by anything.
- The characters (){}[]<>=:/ cannot be used in terminal or
nonterminal symbols since they are used for special purposes in the grammar
file. The character _ can be used only for attaching an
index number to a symbol.
- By default, the left hand symbol of the first rule in the grammar file is
the start symbol of the grammar.
- There can be multiple rules for the same symbol, but all rules for a
symbol must be contiguous in the file.
The symbols on the right hand side of a context-free rule may be marked or
grouped in various ways:
Feature structures
The grammar formalism uses a basic element called a
feature structure. A feature structure consists of a feature name and a
value. The notation used for feature structures looks like this: [number: singular]
where number is the feature name and singular is the value,
separated by a colon. Feature names and values are single words consisting of
alphanumeric characters or other characters except (){}[]<>=:$!
(these are used for special purposes in the grammar file). Upper and lower case
letters used in feature names and values are considered different. For example,
"NUMBER" is not the same as "Number" or "number."
A structure containing more than one feature uses square brackets around the
entire stucture:
[number: singular
case: nominative]
Extra spaces and line breaks are optional.
Feature structures can have either simple values, such as the example above,
or complex values, such as this:
[agreement: [number: singular]
case: nominative]]
where the value of the agreement feature is another feature
structure. Feature structures can be infinitely nested in this manner.
Feature can share values. This is not the same thing as two features having
identical values. In the first example below, the features a and c have
identical values; but in the second example, they share the same value:
[a: [p:q]
b: [p:q]]
[a: $1[p:q]
b: $1]
Shared values are indicated by coindexing them with the prefix $1, $2, and
so on.
Portions of a feature structure can be referred to using the "path" notation.
A path is a sequence of feature names (minimally one) enclosed in angled
brackets (<>). For example, consider this feature structure:
[agreement: [number: singular
case: nominative]]
These are feature paths based on this structure: <number>
<case>
<agreement number>
<agreement case>
Paths are used in feature templates and feature constraints, described
below. All lexical items used by the grammar are assigned three features:
cat, lex, and gloss. These should be treated as reserved names and
not used for other purposes.
- The value of the cat feature is the name of the sublexicon to which
the lexical item belongs, taken from the sublexicon field of the item's
lexical entry.
- The value of the lex feature is the lexical form of the item, taken
from the lexical form field of the item's lexical entry.
- The value of the gloss feature is the gloss of the item, taken from
the gloss field of the item's lexical entry.
For example, here is a
lexical entry for the word fox: \lf `fox
\lx N
\alt Stem
\gl N(fox)
When this entry is used by the grammar, it is represented as this feature
structure: [cat: N
lex: `fox
gloss: N(fox)]
Feature constraints
A rule is followed by zero or more feature
constraint; which refer to symbols used in the rule. The following
specifications apply to feature constraints.
A feature constraint has these parts, in the order listed:
- a feature path that begins with one of the symbols from the context-free
rule
- an equal sign
- either another path or a value
A feature constraint that refers only to symbols on the right hand side of
the rule constrains their co-occurrence. In the following rule and constraint,
the value of the Stem's head pos feature must unify with the value of
the SUFFIX's from_pos feature:
Word -> Stem INFL
<Stem head pos> = <INFL from_pos>
If a feature constraint refers to a symbol on the right hand side of the
rule, and has an atomic value on its right hand side, then the designated
feature must not have a different value. In the following rule and constraint,
the head case feature for the PRONOUN node of the parse tree must
either be originally undefined or equal to NOM: Word -> PRONOUN
<PRONOUN head case> = NOM
(If the head case feature of the PRONOUN node was originally
undefined, then, after unification succeeds, it will be equal to NOM.)
A feature constraint that refers to the symbol on the left hand side of the
rule passes information up the parse tree. In the following rule and constraint,
the value of the head feature is passed from the INFL node up to the
Word node:
Word -> Stem INFL
<Word head> = <INFL head>
PC-KIMMO allows disjunctive feature constraints with its phrase structure
rules. Consider these two rules: Stem_1 -> PREFIX Stem_2
<PREFIX from_pos> = <Stem_2 head pos>
<PREFIX change_pos> = +
<Stem_1 head> = <PREFIX head>
Stem_1 -> PREFIX Stem_2
<PREFIX from_pos> = <Stem_2 head pos>
<PREFIX change_pos> = -
<Stem_1 head> = <Stem_2 head>
These rules have the same context-free rule part. They can therefore be
collapsed into this single rule , which has a disjunction in its feature
constraints: Stem_1 -> PREFIX Stem_2
<PREFIX from_pos> = <Stem_2 head pos>
{
<PREFIX change_pos> = +
<Stem_1 head> = <PREFIX head>
/
<PREFIX change_pos> = -
<Stem_1 head> = <Stem_2 head>
}
Disjunctive feature constrains may be nested up to eight levels deep.
Feature templates
The following specifications apply to feature
templates.
A feature template has these parts, in the order listed:
- the keyword Let
- the template name
- the keyword be
- a feature definition
- an optional period (.)
If the template name is a terminal
category (a terminal symbol in one of the context-free rules), the template
defines the default features for that category. Otherwise the template name
serves as an abbreviation for the associated feature structure. Templates may
occur anywhere in the file (interspersed among the rules), but a template must
occur before any rule or other template that uses the abbreviation it defines.
Template names are single words consisting of alphanumeric characters or
other characters except (){}[]<>=:$! (these are used for special
purposes in the grammar file). The character \ should not be used as
the first character of a template name because that is how fields are marked in
the lexicon file. Upper and lower case letters used in template names are
considered different. For example, "PLURAL" is not the same as "Plural" or
"plural."
The abbreviations defined by templates are usually used in the feature field
of entries in the lexicon file. For example, the lexical entry for the irregular
plural form feet may have the abbreviation pl in its features
field. The grammar file would define this abbreviation with a template like
this:
Let pl be [number: PL]
The path notation may also be used: Let pl be <number> = PL
More complicated feature structures may be defined in templates. For
example, Let 3sg be [tense: PRES
agr: 3SG
finite: +
vform: S]
which is equivalent to: Let 3sg be [<tense> = PRES
<agr> = 3SG
<finite> = +
<vform> = S]
In the following example, the abbreviation irreg is defined using
another abbreviation: Let irreg be <reg> = -
pl
The abbreviation pl must be defined previously in the grammar
file or an error will result. A subsequent template could also use the
abbreviation irreg in its definition. In this way, an inheritance
hierarchy features may be constructed.
Feature templates permit disjunctive definitions. For example, the lexical
entry for the word deer may specify the feature abbreviation
sg-pl. The grammar file would define this as a disjunction of feature
structures reflecting the fact that the word can be either singular or plural:
Let sg/pl be {[number:SG]
[number:PL]}
This has the effect of creating two entries for deer, one with
singular number and another with plural. Note that there is no limit to the
number of disjunct structures listed between the braces. Also, there is no slash
(/) between the elements of the disjunction as there is between the
elements of a disjunction in the rules. A shorter version of the above template
using the path notation looks like this: Let sg/pl be <number> = {SG PL}
Abbreviations can also be used in disjunctions, provided that they have
previously been defined: Let sg be <number> = SG
Let pl be <number> = PL
Let sg/pl be {[sg] [pl]}
Note the square brackets around the abbreviations sg and
pl without square brackets they would be interpreted as simple values
instead.
Feature templates can assign default atomic feature values, indicated by
prefixing an exclamation point (!). A default value can be overridden by an
explicit feature assignment. This template says that all members of category N
have singular number as a default value:
Let N be <number> = !SG
The effect of this template is to make all nouns singular unless they are
explicitly marked as plural. For example, regular nouns such as book do
not need any feature in their lexical entries to signal that they are singular;
but an irregular noun such as feet would have a feature abbreviation
such as pl in its lexical entry. This would be defined in the grammar
as [number: PL], and would override the default value for the feature
number specified by the template above. If the N template above used SG
instead of !SG, then the word feet would fail to parse, since
its number feature would have an internal conflict between SG
and PL.
Parameter settings are used to
override various default settings assumed in the grammar file. Parameter
settings are optional. In the absence of a parameter setting, a default value is
used. A parameter setting has these parts, in the order listed:
- the keyword Parameter
- an optional colon (:)
- one or more keywords identifying the parameter
- the keyword is
- the parameter value
- an optional period (.)
PC-KIMMO recognizes the following parameters:
- Start symbol defines the start symbol of the grammar. For
example,
Parameter Start symbol is Word
declares that the parse goal of the grammar is the nonterminal category
Word. The default start symbol is the left hand symbol of the first
context-free rule in the grammar file.
- Attribute order specifies the order in which feature attributes
are displayed. For example,
Parameter Attribute order is cat head root root_pos
declares that the cat attribute should be the first one shown
in any output from PC-KIMMO and that the other attributes should be shown in
the relative order shown, with the root_pos attribute shown last
among those listed, but ahead of any attributes that are not listed above.
Attributes that are not listed are ordered according to their character code
sort order. If the attribute order is not specified, then the category feature
cat is shown first, with all other attributes sorted according to
their character codes.
- Category feature defines the label for the category attribute.
For example,
Parameter Category feature is Categ
declares that Categ is the name of the category attribute. The
default name for this attribute is cat
- Lexical feature defines the label for the lexical attribute. For
example,
Parameter Lexical feature is Lex
declares that Lex is the name of the lexical attribute. The
default name for this attribute is lex
- Gloss feature defines the label for the gloss attribute. For
example,
Parameter Gloss feature is Gloss
declares that Gloss is the name of the gloss attribute. The
default name for this attribute is gloss.
Lexical rules are used to modify
the feature structures of lexical entries. As noted in Shieber
1985, something more powerful than just abbreviations for common feature
elements is sometimes needed to represent systematic relationships among the
elements of a lexicon. This need is met by lexical rules, which express
transformations rather than mere abbreviations.
Lexical rules are similar to feature templates, but are more powerful. While
feature templates assign a feature structure to lexical items by means of
unification, lexical rules map one feature structure to another, thus
transforming it. The name of a lexical rule is included in the features field of
lexical entries, similar to feature abbreviations.
A lexical rule has these parts, in the order listed:
- the keyword Define
- the name of the lexical rule
- the keyword as
- the rule definition
- an optional period (.)
The rule definition consists of
one or more mappings. Each mapping has three parts: an output feature path, an
assignment operator, and the value assigned, either an input feature path or an
atomic value. Every output path begins with the feature name out and
every input path begins with the feature name in. The assignment
operator is either an equal sign (=) or an equal sign followed by a
"greater than" sign (=>). (These two operators are equivalent in
PC-KIMMO, since the implementation treats each lexical rule as an ordered list
of assignments rather than using unification for the mappings that have an equal
sign operator.) Consider the information shown in figure 4.8A.
Figure 4.8A A lexical rule example
;lexical item
\lf `mouse
\fea irreg POS_Gloss
\gl `mouse
;feature template
LET irreg be = -
;lexical rule
DEFINE POS_Gloss as
=
=
=
= .
The feature field (\fea ) of the lexical entry contains two
labels: irreg is a feature abbreviation and is defined by a feature
template (the LET statement), while POS_Gloss is the name of a
lexical rule which is defined by the DEFINE statement.
Figure 4.8B Feature structure before application of
lexical rule
[ cat: ROOT
head: [ agr: [ 3sg:- ]
number:PL
pos: N
proper:-
verbal:- ]
reg: -
lex: `mice
gloss: `mouse ]
Figure 4.8C Feature structure after application of
lexical rule
[ cat: ROOT
head: [ agr: [ 3sg:- ]
number:PL
pos: N
proper:-
verbal:- ]
lex: `mice
gloss: N ]
When the lexicon entry is loaded, it is initially assigned the feature
structure shown in figure 4.8B, which is
the unification of the information given in the various fields of the lexicon
entry, including the feature abbreviation pl. After the complete feature
structure has been built, the lexical rule named POS_Gloss is applied,
producing the feature structure shown in figure 4.8C. Note that
the change in the value of the gloss feature from "`mouse" to "N" is done by
direct mapping, not unification.
There are two important points about using lexical rules. First, the feature
structure of a lexical item that has undergone a lexical rule is entirely
determined by the mappings in the lexical rule. In the lexical rule in figure 4.8A,
the first three mappings (for cat, head, and lex), though
they seem redundant, are needed to carry over these feature values from the
input feature structure to the output feature structure. Notice that the feature
reg which is present in the input feature structure in figure 4.8B is absent
from the output feature structure in figure 4.8C; this is due
to the fact that the lexical rule which applied to the feature structure did not
include a mapping for the reg feature.
Second, lexical rules apply sequentially in the order in which they are given
in the grammar file.
Figure
4.9 shows a sample grammar file.
Figure 4.9 A sample grammar file
;FEATURE TEMPLATES (optional)
;Feature definitions
Let pl be <head number> = PL
LET v/n be <from_pos> = V
<head pos> = N
<head number> = !SG
LET v\aj be <from_pos> = AJ
<head pos> = V
;Category definitions
Let N be <cat> = ROOT
<head pos> = N
<head number> = !SG
Let V be <cat> = ROOT
<head pos> = V
Let AJ be <cat> = ROOT
<head pos> = AJ
;PARAMETER SETTINGS (optional)
PARAMETER Start symbol is Word
;RULES
RULE
Word = Stem INFL
<Stem head pos> = <INFL from_pos>
<Word head> = <INFL head>
RULE
Stem_1 = PREFIX Stem_2
<PREFIX from_pos> = <Stem_2 head pos>
<Stem_1 head> = <PREFIX head>
RULE
Stem_1 = Stem_2 SUFFIX
<Stem_2 head pos> = <SUFFIX from_pos>
<Stem_1 head> = <SUFFIX head>
RULE
Stem = ROOT
<Stem head> = <ROOT head>
The generation
comparison file serves as input to the compare generate command (see
section 4.5.12).
It consists of groupings of a lexical form followed by one or more surface forms
that are expected to be generated from the lexical form. The following
specifications apply to the generation comparison file.
- Each form must be on a separate line.
- Leading spaces are ignored.
- A blank line (or end of file) indicates the end of a grouping. Extra blank
lines are ignored.
- The first form in each grouping is the lexical form to be input to the
generator. Its gloss does not have to be included, since the generator does
not use the lexicon; however, including a gloss with the lexical form does no
harm--it is simply ignored.
- Succeeding forms in each grouping are surface forms that are the expected
output of the generator.
Figure
4.10 shows a sample generation comparison file.
Figure 4.10 A sample generation comparison file
`trace+ed
traced
`trace+able
traceable
re-+`trace
re-trace
retrace
The recognition
comparison file serves as input to the compare recognize command (see
section 4.5.12).
It consists of groupings of a surface form followed by one or more lexical forms
that are expected to be recognized from the surface form. The following
specifications apply to the recognition comparison file.
- Each form must be on a separate line.
- Leading spaces are ignored.
- A blank line (or end of file) indicates the end of a grouping. Extra blank
lines are ignored.
- The first form in each grouping is the surface form to be input to the
recognizer.
- Succeeding forms in each grouping are lexical forms that are the expected
output of the recognizer. The gloss of a form follows it on the same line,
separated by one or more spaces. The gloss must match exactly (including
spaces) the way it is output from the recognizer.
Figure
4.11 shows a sample recognition comparison file.
Figure 4.11 A sample recognition comparison
file
traced
`trace+ed [ V(trace)+PAST ]
`trace+ed [ V(trace)+PAST.PRTC ]
traceable
`trace+able [ V(trace)+ADJR ]
retrace
re-+`trace [ REP+V(trace).INF ]
The pairs comparison
file serves as input to the compare pairs command (see section 4.5.12).
It consists of pairs of lexical and surface forms; that is, a lexical form
followed by exactly one surface form. It is expected that the surface form will
be recognized from the lexical form and that the lexical form will be generated
from the surface form. Glosses do not have to be included with lexical forms,
since the generator does not use the lexicon; however, including a gloss with
the lexical form does no harm--it is simply ignored. When recognizing a surface
form, the lexicon is used to identify the constituent morphemes and verify that
they occur in the correct order, but the gloss part of a lexical entry is not
used. The following specifications apply to the pairs comparison file.
- Each form must be on a separate line.
- Leading spaces are ignored.
- A blank line (or end of file) indicates the end of a grouping. Extra blank
lines are ignored.
- The first form of a pair is the lexical form, which is input to the
generator. It is the expected output on inputting the second (surface) form to
the recognizer. The gloss is not included with the lexical form.
- The second form of a pair is the surface form, which is input to the
recognizer. It is the expected output on inputting the first (lexical) form to
the generator.
Figure
4.12 shows a sample pairs comparison file.
Figure 4.12 A sample pairs comparison file
`trace+ed
traced
`trace+able
traceable
re-+`trace
re-trace
re-+`trace
retrace
The synthesis
comparison file serves as input to the compare synthesize command (see
section 4.5.12).
It consists of groupings of a morphological form followed by one or more surface
forms that are expected to be synthesized from the morphological form. The
following specifications apply to the synthesis comparison file.
- Each form must be on a separate line.
- Leading spaces are ignored.
- A blank line (or end of file) indicates the end of a grouping. Extra blank
lines are ignored.
- The first form in each grouping is the morphological form to be input to
the synthesizer. A morphological form is a sequence of morpheme glosses
separated by spaces.
- Succeeding forms in each grouping are surface forms that are the expected
output of the generator.
Figure
4.12A shows a sample synthesis comparison file.
Figure 4.12A A sample synthesis comparison
file
`trace +ED
traced
`trace +EN
traced
`trace +AJR25a
traceable
ORD5+ `trace
retrace
The generation file consists
of a list of lexical forms. It serves as input to the file generate
command (see section 4.5.13),
which returns a file (or screen display) whose format is identical to the
generation comparison file. The following specifications apply to the generation
file.
- Each form must be on a separate line.
- Extra white space, blank lines, and comment lines are ignored.
- Each form is assumed to be a lexical form. If a gloss is included, it is
ignored.
Figure
4.13 shows a sample generation file.
Figure 4.13 A sample generation file
`cat
`cat+s
`cat+'s
`cat+s+'s
`fox
`fox+s
`fox+'s
`fox+s+'s
The recognition file
consists of a list of surface forms. It serves as input to the file
recognize command (see section 4.5.14),
which returns a file (or screen display) whose format is identical to the
recognition comparison file. The following specifications apply to the
recognition file.
- Each form must be on a separate line.
- Extra spaces, blank lines, and comment lines are ignored.
- Each form is assumed to be a surface form.
Figure
4.14 shows a sample recognition file.
Figure 4.14 A sample recognition file
cat
cats
cat's
cats'
fox
foxes
fox's
foxes'
The synthesis file consists
of a list of morphological forms. A morphological form is a sequence of morpheme
glosses separated by spaces. A synthesis file serves as input to the file
synthesis command (see section 4.5.13),
which returns a file (or screen display) whose format is identical to the
synthesis comparison file. The following specifications apply to the synthesis
file.
- Each form must be on a separate line.
- Extra white space, blank lines, and comment lines are ignored.
- Each form is assumed to be a morphological form.
Figure
4.14A shows a sample synthesis file.
Figure 4.14A A sample synthesis file
`cat
`cat +PL
`cat +GEN
`cat +PL +GEN
`fox
`fox +PL
`fox +GEN
`fox +PL +GEN
Figure 4.15
summarizes the default file names and extensions assumed by PC-KIMMO. Two
entries are given for the different kinds of files. The first is the name
PC-KIMMO will assume if no file name at all is given to a command that expects
that kind of file. The second entry (with the *) shows what extension PC-KIMMO
will add if a file name without an extension is given.
Figure 4.15 Default file names and extensions
Rules file: RULES.RUL
*.RUL
Lexicon file: LEXICON.LEX
*.LEX
Grammar file: GRAMMAR.GRM
*.GRM
Generation comparison file: DATA.GEN
*.GEN
Recognition comparison file: DATA.REC
*.REC
Pairs comparison file: DATA.PAI
*.PAI
Synthesis comparison file: DATA.SYN
*.SYN
Take file: PCKIMMO.TAK
*.TAK
Log file: PCKIMMO.LOG
*.LOG
[
Guide contents
| Chapter
contents | Next section: 4.8 Trace
Formats | Previous section: 4.6
Alphabetic List of Commands ]