XML Corpus Encoding Standard  Document XCES 0.2. Last Modified 16 August 2000.

Department of Computer Science
Vassar College
Poughkeepsie NY 
USA

  


Equipe Langue et Dialogue
LORIA/CNRS 
Vandúuvre-lès-Nancy
FRANCE


****XCES is now an EAGLES standard ****


 
 
 
 

XCES

Corpus Encoding Standard for XML

XML version of the CES DTDs

BETA RELEASE




This is a Beta release of XCES, which instantiates the EAGLES Corpus Encoding Standard (CES) DTDs for linguistic corpora, developed by the Department of Computer Science, Vassar College, and Equipe Langue et Dialogue, LORIA/CNRS. XCES is under development and subject to change.

We are developing documentation to support XCES. However, the existing CES documentation supporting general encoding practices for linguistic corpora and tag usage is largely relevant to the XCES instantiation, and should be consulted.

XCES is under development. Because the XML framework provides us with means to go well beyond the capabilities of SGML, this development is taking several forms: (1) XML support for additional types of annotation and resources, including discourse/dialogue, lexicons, and speech; (2) creation of additional XSLT scripts to perform common operations and trasduce among formats (including different annotation formats); (3) development of a set of XML schemas instantiating an abstract data model for linguistic annotations, together with a hierarchy of derived types for a broad range of annotation types; and (4) creation of a repository of annotation formats for "off the shelf" use or easy modification via the XCES schemas.


DTDs for XCES

Download all DTDs: xces-dtd-0_2.zip

Usage

cesDoc resources :
        <?xml version="1.0"?>
        <!DOCTYPE cesDoc PUBLIC "-//CES//DTD XML cesDoc//EN"
                                "dtd/xcesDoc.dtd" [
        ]>...
cesAna resources :
        <?xml version="1.0"?>
        <!DOCTYPE cesDoc PUBLIC "-//CES//DTD XML cesAna//EN"
                                "dtd/xcesAna.dtd" [
        ]>...
cesAlign resources :
        <?xml version="1.0"?>
        <!DOCTYPE cesDoc PUBLIC "-//CES//DTD XML cesAlign//EN"
                                "dtd/xcesAlign.dtd" [
        ]>...

Notes

Language Identification

The attribute xml:lang (CDATA) has been added to the global attributes in parallel with the attribute lang (IDREF) to be conformant both with the XML recommendation (see http://www.w3.org/TR/REC-xml#sec-lang-tag) and the SGML CES DTDs.

XLink support

See http://www.w3.org/TR/xlink

Support for the XLink specification by including the sub-dtd xlink.ent (for simple, extended, locators and arc elements) is under development.

XPointer/Xpath support

See http://www.w3.org/TR/xptr and http://www.w3.org/TR/xpath

We are currently implementing the use  of XPointers and XPaths for locator element types.

Samples will be available in the near future.

XSL Stylesheets

We have also developed a set of XSL stylesheets for cesDoc documents. To use these stylesheets, you should use XT, the XSL engine developed by James Clark.

Usage:

        java -mx64m \
                -Dcom.jclark.xsl.sax.parser=com.jclark.xml.sax.Driver \
                com.jclark.xsl.sax.Driver your-xces-doc.xml xsl/html/cesDoc.xsl
Only HTML output is supported by the stylesheets. Output produced with the stylesheets can be customized by setting or overriding variables within the xsl/html/config.xsl file. If you do not want to modify the XSL source files, you can use a driver; see : xsl/html/driver.xsl

Samples

Download all stylesheets: xces-xsl-0_2.zip

We are currently working on a set of stylesheets to support the cesAna and cesAlign DTDs.


Questions/comments to Nancy Ideide@cs.vassar.eduor Patrice Bonhommebonhomme@loria.fr