Text Encoding and Annotation
If corpora is said to be
unannotated it appears in its existing raw state of plain text, whereas
annotated corpora has been enhanced with various types of linguistic
information. Unsurprisingly, the utility of the corpus is increased when it has
been annotated, making it no longer a body of text where linguistic information
is implicitly present, but one which may be considered a repository of
linguistic information. The implicit information has been made explicit through
the process of concrete annotation.
For example, the form "gives" contains the implicit part-of-speech
information "third person singular present tense verb" but it is only retrieved
in normal reading by recourse to our pre-existing knowledge of the grammar of
English. However, in an annotated corpus the form "gives" might appear as
"gives_VVZ", with the code VVZ indicating that it is a third person singular
present tense (Z) form of a lexical verb (VV). Such annotation makes it quicker
and easier to retrieve and analyse information about the language contained in
the corpus.
Leech (1993) describes 7
maxims which should apply in the annotation of text corpora.
Before reading about types of annotation, you might like to learn something
about the formats
they can take.