Parallel Corpora

Introduction
Projects and People
Sources of texts
Software
References

Introduction

The term parallel corpora covers a variety of corpora types, but in general it refers to texts that are translations of each other (or are at least on the same topic). The corpora may be aligned in some way. (And in fact many researchers are investigating methods of automatically aligning texts.)

This page gives sources of information concerning tools, texts, and research related to parallel corpora. Thanks to Derek Lewis, Raphael Salkie, Knut Hofland, and Hans Paulussen for providing the initial information to produce this page.

Projects and People

1. INTERSECT: a Parallel Corpus Project
Raphael Salkie,
The Language Centre,
University of Brighton
Falmer, Brighton, BN1 9PH
England.
Email: RMS3@BRIGHTON.AC.UK

The INTERSECT (International Sample of English Contrastive Texts) Project at Brighton University began in the Spring of 1994. The aim is to construct and analyse a parallel bilingual corpus of French and English written texts, adding other languages later if resources permit. TEMPORARILY LOST??

2. CONTRAGRAM
Contragram Newsletter. The Contrastive Grammar Research Group. University of Gent.

3. LINGUA PROJECT
A project involving the construction of multilingual corpora for English, French, Greek and some others, for use in language pedagogy.

For information on the associated Windows software. And some associated teaching materials.

4. MULTEX PROJECT
Building tools for multilingual corpus access, and also a bunch of sample corpora. Contact veronis@lpl.univ-aix.fr

MULTEX-EAST
Parallel and comparable corpora in Eastern European languages.

5. PARACONC USERS
The Macintosh version of ParaConc, a parallel concordancer, is available free of charge for non-commercial research. The Windows version is in beta test and is also freely available.

Researchers interested in using ParaConc should contact Michael Barlow (barlow@ruf.rice.edu) and/or look at the ParaConc page.

6. A Scandinavian Project to build multilingual (english/swedish/norwegian/finnish) parallel corpora. Contact stig.johansson@iba.uio.no

7. English-Norwegian Parallel Corpus Project
ENPC Information on English-Norwegian Parallel Corpus (University of Oslo); includes an on-line search facility

Knut Hofland has also set up an interesting web-based search engine for some English-French texts.

8. TRIPTIC: TRIlingual Parallel Text Information Corpus

TRIPTIC is a trilingual corpus developed for the analysis of prepositions in English, French and Dutch.

The corpus forms part of the empirical data used for research on the contrastive analysis of prepositions (PhD thesis). The object of the study, which assumes the cognitive linguistic framework, is to examine in which way languages converge and diverge in the semantic structure of so-called function words.

The corpus consists of 2,000,000 words, one half fiction, the other half non-fiction material. All paragraphs are aligned, allowing automatic selection of the n-th paragraph in the 3 languages.

The original text files are now being converted into a database structure (4th Dimension on Macintosh), in order to facilitate the description of the prepositions under study.

For further information, contact Hans Paulussen hpa@elv.fundp.ac.be. See also the CONTRAGRAM article on TRIPTIC.

9. Translation Corpus of English and German

Prof. Schmied at the Technical University Of Chemnitz-Zwickau is compiling a translation corpus of English and German.

The corpus at present includes EC-material, academic textbooks, modern fiction and tourist brochures (approx. 500000 words altogether). The researchers are currently looking at aspects such as culture-specific problems in translation or translationese.
Contact:
hildegard.schaeffler@phil.tu-chemnitz.de or
josef.schmied@phil.tu-chemnitz.de

10. Corpora projects Språkteknologi, University of Uppsala

Erik Tjong Kim Sang sent email about a project in Sweden which is currently working on structuring two multilingual text corpora and integrating them with lexical resources they have available. The prime goal for the resulting corpus is applying it for research in Machine Translation. Anna Sågvall-Hein is the project leader.

11. Thai On-Line Library TOLL of parallel Thai/English texts.

Sources of texts

European Language Resources Association ELRA and ELDA have a variety of resources including parallel texts.

Canadian HansardWeb-searchable.

Canadian Embassy English texts
Canadian Embassy French texts

LDC material: on ftp.cis.upenn.edu:/pub/ldc. Tends to be expensive if priced with respect to individual corpora. Includes Canadian Hansard and EC materials.

European Corpus Initiative (ECI) have produced a cheap CD-ROM which contains a wide variety of corpora, including some non-aligned parallel texts.

A parallel GERMAN-NORWEGIAN CORPUS. Contact:
Cathrine Fabricius-Hansen
Germanistisk institutt
P.b. 1004, Blindern
N-0315 Oslo
e-mail: c.f.hansen@german.uio.no

Software

ParaConc can be downloaded from ParaConc or contact Michael Barlow (barlow@ruf.rice.edu).

For information on a PC-based alignment program, contact Knut Hofland.

For information on alignment research, contact University of Lancaster, IBM France, ????

WordSmith Tools, produced by Mike Scott and web-published by OUP, includes a text aligner as well as a variety of other tools. See also Mike Scott's WordSmith page.

References

Pernilla Danielsson and Daniel Ridings. Parallel Texts in Göteborg

See also the Parallel Corpora bibliography in the Corpus Linguistics References.

Send comments, suggestions, and additions to Michael Barlow (barlow@ruf.rice.edu)

(unknown)