Corpora Available from The Linguistic Data Consortium

September 1, 1995


  1. Introduction.

  2. Prices and Conditions of Purchase

  3. Listing of corpora by year of release

  4. Speech Corpora: Descriptions and Ordering information
    1. TIMIT Acoustic-Phonetic Continuous Speech Corpora
      1. Original ARPA-sponsored vertion
      2. NYNEX Telephone Version of TIMIT Corpus (NTIMIT)
    2. The Resource Management Corpora
    3. Air Travel Information System (ATIS) Corpora
      1. ATIS0 Spontaneous Speech Pilot Corpus and Relational Database
      2. ATIS2
      3. ATIS3
    4. Continuous Speech Recognition (CSR) Corpora sponsored by ARPA
      1. ARPA Continuous Speech Recognition Corpus I : Wall Street Journal Sentences (WSJ0, or CSR-I)
      2. ARPA Continuous Speech Recognition Corpus II : Wall Street Journal Sentences (WSJ1, or CSR-II)
      3. ARPA Continuous Speech Recognition Corpus III. Financial News Corpus (with language model)
      4. DARPA Continuous Speech Recognition Corpus-IV : Radio Broadcast News Corpus (hub-4)
      5. DARPA Continuous Speech Recognition Corpus-IV : Multi-Microphone (hub-3)
    5. Switchboard Corpus of Recorded Telephone Conversations
    6. Switchboard Corpus Excerpts , Credit Card Conversations.
    7. Texas Instruments 46-Word Speaker-Dependent Isolated Word Corpus (TI46)
    8. Texas Instruments Speaker-Independent Connected-Digit Corpus (TIDIGITS)
    9. The HCRC Map Task Corpus
    10. Air Traffic Control Corpus (ATC0)
    11. SPIDRE Speaker Identification Corpus
    12. YOHO Speaker Verification Corpus
    13. OGI Multi-Language Corpus
    14. OGI Spelled and Spoken Telephone Corpus
    17. King Corpus for Speaker Verification Research
    18. WSJCAM0: Cambridge Read News Corpus.
    19. TRAINS Spoken dialog corpus.
    20. NYNEX PhoneBook Database
    21. LATINO-40 Spanish Read News Corpus
    22. Frontiers in Speech Processing.
    23. CTMIT
    24. FFMTMIT

  5. Text Corpora : Descriptions and Ordering Information
    1. Association for Computational Linguistics Data Collection Initiative (ACL/DCI)
    2. The Penn Treebank Project - Release 2.
    3. TIPSTER Information Retrieval Text Research Collection
      1. TIPSTER Volume 1 , March 1992
      2. TIPSTER Volume 2 , July 1992
      3. TIPSTER Volume 3 , April 1993
    4. United Nations Parallel Text Corpus (English, French, Spanish)
    5. Japanese Business News Text
    6. Spanish News Text Collection
    7. European Corpus Initiative-1

  6. Lexical Databases : Descriptions and Ordering Information
    1. CELEX-2 Lexical Database
    2. COMLEX : COMmon LEXical Database of English
      1. COMLEX English Syntax
      2. COMLEX English Pronunciation Version 0.0
      3. COMLEX English Pronunciation Version 0.2


Each section below describes a corpus or a set of corpora. They are listed by type (speech, text, lexicons, other) and within type by the membership year (MY) in which they were released by LDC. In the case of series or sets of corpora, each individual corpus or segment is described in a separate subsection. The descriptions are brief rather than complete; more information can be found by clicking on "README" or "Documentation" iconets.

In the catalog, each description is followed by six items of information:

  1. The name by which the corpus is generally known
  2. The LDC catalog order number(s), explained below
  3. The NIST Catalog numbers assigned to the discs, if they have ever been available through NIST or NTIS, otherwise ``NA''
  4. The date and membership year of official release by LDC
  5. The current price for nonmembers, if available
  6. Whether a separate license (User Agreement) is required
The LDC catalog order number is a unique identifier for convenience in referring to corpora, parts of corpora, and individual discs as needed. It is made up of the following: Here are two examples to illustrate both why this system was adopted and how it works.

Example 1: The United Nations Parallel Text Corpus was published in March of 1994, thus in membership year (MY) 94, consists entirely of text (T), and was assigned text corpus number 4. It comprises three discs: the first contains English texts, the second the corresponding French texts, and the third the corresponding Spanish. They are available either (A) as a set of three or (B) separately. Thus LDC94T4A refers to the UN corpus as a whole, LDC94TB-1 to the English disc alone, LDC94T4B-2 to the French alone, and LDC94T4B-3 to the Spanish alone.

Shortly after release, the Spanish disc was found to have a manufacturing defect and was replaced with a new one, so if there is need to refer to the them individually, the original is now called 3.0 and the replacement 3.1.

Example 2: The second Continuous Speech Recognition corpus, collected in 1993 and distributed in early 1994, was assigned corpus number 8. It contains 14 discs of speech recorded over a Sennheiser HMD414 microphone (a de facto standard in ARPA evaluations); 15 discs with the same speech recorded over another microphone; and 5 discs containing unique (unpaired) data: speech recorded only once, transcriptions, test or evaluation data, etc., much of which is also needed to make full use of the paired speech recordings. To satisfy customer preferences, the corpus is offered by LDC in three configurations: (A) the complete corpus of 34 discs; (B) the ``Sennheiser corpus,'' i.e., the whole corpus minus the ``other microphone'' data, on 19 discs; and (C) the ``other microphone'' corpus, i.e., the whole corpus minus the Sennheiser data, 20 discs. These are designated as follows:

CSR-II Complete: LDC94S13A, consisting of LDC94S13A-1 through LDC94S13A-34

CSR-II Sennheiser: 19 discs, LDC94S13B, consisting of LDC94S13B-1 through S13B-7, S13B-11, S13B-13 through S13B-16, S13B-18 through S13B-21, and S13B-32 through S13B-34

CSR-II Other: 20 discs, LDC94S13C, consisting of LDC94S13C-8 through S13C-10, S13C-12 through S13C-14, S13C-17, S13C-22 through S13C-34

Prices and Conditions of Purchase

The following are the procedures and conditions for obtaining corpora from the LDC:

For LDC Members:

LDC membership is annual, with the membership year (MY) running from 1 September to 31 August. Each LDC corpus is identified by the MY of its release and membership fees purchase a paid-up license to that MY's LDC corpora.

Members receive one copy of each requested LDC corpus at no charge; there may be charges for corpora owned or produced by others and distributed by LDC.

Members may also purchase extra "convenience copies" of LDC corpora, at $100 per disk or the catalog price, for use at approved sites. These convenience copies are subject to the same restrictions and covered by the same license, if any, as the primary copies.

Notices will be mailed to all members when new data sets are available. When corpora are re-issued in revised, enhanced, or supplemented form, unless the reason is defective materials, they will be distributed only to those whose LDC membership is current in the MY of re-issue. Nonmembers who wish to receive upgrades must pay the nonmember price for the re-issue.

At this time it is no longer possible to purchase a 1993 or 1994 membership. Members who are in good standing (i.e. current members) may purchase corpora from these memership years at the rate of $100 per CD-ROM.

As an incentive to purchasers of 1996 memberships, the following corpora from MY 93 and 94 are being offered in the 96 MY; Resource Management Complete, CSR I, HCRC Map Task Corpus, and BRAMSHILL. These corpora will be distributed on a first come first served basis to those paying for 1996 memberships. Corpora that are in limited quantity will be indicated by a number in parentheses following the LDC Catalog number, which represents the quantity on hand (As of March 29, 1996)

The cost of membership is $2,000 for nonprofit and government institutions and $20,000 for commercial organizations. To requesr two copies of the required membership agreement please call 1 (215) 898-0464 or send email to You can also retrieve them here.

For Nonmembers:

With the exception of a few corpora marked "Members Only" (MO) due to restrictions from the copyright owners, nonmembers may purchase single copies of the listed items. Prices are set by the LDC from tinme to time and normally include a permanent "research-only" license (i.e. no commercial use). Payment may be made by check drawn from a bank with branches in the United States or payment may be wired to: Mellon Bank East, ABA NO. 031000037, Philadelphia, PA, for credit to The Trustees of the University of Pennsylvania, Account No 2945020, Attn: Sarah Parnum, 215-898-0464. For inquiries e-mail

Prices are subject to change; the prices below are effective until December 31, 1996. Nonmembers add a shipping charge for each order: $30 US and Canada, $50 overseas.

Listing of corpora by year of release


Price   Set-of   Description                            Release Date

                                                        or Catalog #

  TBA   14      Corpus of Spoken American English       Spring 1997

  TBA    1      English Language Internat. News         Fall 1996

  TBA    3      JURIS: Legal Text (500 M words)         Fall 1996

  TBA   15      SWITCHBOARD (Revised)                   Fall 1996


Price   Set-of   Description                            Release Date

                                                        or Catalog #

  750    6      Resource Management Complete Set        LDC93S3A (28)

 1000   15      CSR-I (WSJ0) Complete                   LDC93S6A

  200    8      HCRC Map Task Corpus                    LDC93S12 (47)

  500    9      BRAMSHILL                               LDC94S20 (12)

10000    1      COMLEX English Syntax Lexicon 111       LDC96L6

10000    1      COMLEX Pronounciation Dictionary        LDC96L7

    0    1      Frontiers in Speech Processing          LDC96S29 (45)

  150    1      CELEX-2                                 LDC96L19*

   MO    1      Spanish Text Collection                 LDC95T9*

  100    1      CTIMIT                                  LDC96S30

  100    1      FFMTIMIT                                LDC96S32

 2500    3      CSRIV Radio Broadcast News: Hub4        LDC96S31

   MO    3      CSRIV: Hub 3                            LDC96S33

   MO    2      N. American Business News Text          Summmer 1996*    

 2500    1      Mandarin Business News Text             Summmer 1996*

   MO    1      European Language Newspaper Text        Summmer 1996*

 5000    2      Hansard                 	        Summmer 1996*

  TBA   14      JEIDA Japanese Speech Data              Fall 1996

  TBA    1      Mandarin Lexicon                        Fall 1996

  TBA    1      Spanish Lexicon                         Fall 1996

  TBA    6      POLYPHONE-II (American Spanish)         Summer 1996

  TBA    2      Mandarin Telephone Speech               Summer 1996

  TBA    2      Spanish Telephone Speech                Summer 1996

  TBA    6      CALLFRIEND Language ID Corpus           Fall 1996

  TBA	 2	Speaker ID Evaluation Test(SWB)	        Summer 1996

  TBA           DCIEM/HCRC Map Task          		Spring 1996

  TBA    1      WBUR Radio Speech Corpus        	Summer 1996


Price   Set-of   Description                            Release Date

                                                        or Catalog #

 2500  	 1      KING Speaker Verification	        LDC95S22 

   MO  	 3  	CSR-III Speech: Dev and Eval Data       LDC95S23

   MO  	 4  	CSR-III Text: Language Model 	        LDC95T6

 2000    6	WSJCAM0: Cambridge Read News 	        LDC95S24

 2500    1	TRAINS dialog corpus		        LDC95S25

 2000    2	ATIS3: Test Data		        LDC95S26

 5000    3      PHONEBOOK: NYNEX Isolated Words         LDC95S27

 2500    1      Treebank-2		  	        LDC95T7

   MO    1      Japanese Business News Text	        LDC95T8

 2000    2      LATINO-40 Spanish Read News Corp        LDC95S28

   MO    1	Spanish Text Collection		        LDC95T9*

   MO    2	N. American Business News Text          Spring 1996*

 2500    1	Mandarin Business News Text	        Spring 1996*

   MO	 1  	European Language Newspaper Text        Spring 1996*

 5000    2      Hansard French/English		        Spring 1996*

10000	 1	COMLEX English Syntax Lexicon 111       LDC95L4

10000 	 1	COMLEX Pronounciation Dictionary        LDC95L5

  150 	 1	CELEX-2					LDC96L19*

*Available to 1995 and 1996 members


Price   Set-of  Description                             LDC Catalog # 

 1500   34      CSR-II (WSJ1) Complete 		        LDC94S13A

 1250  	 8      Air Traffic Control 		        LDC94S14 (23)

 2000    2      SPIDRE 				        LDC94S15 (35)

  750  	 1      YOHO Speaker Verification 	        LDC94S16

  200  	 1      OGI Multilanguage Corpus		LDC94S17

  100    1      OGI Spelled  Spoken Word 	        LDC94S18

 2500  	 3      ATIS3				        LDC94S19

  500  	 9      BRAMSHILL 			        LDC94S20 (12)

10000    8      MACROPHONE (American English) 	        LDC94S21

 2500    3      UN Parallel Text (Complete)  	        LDC94T4A

 1000  	 1      UN Parallel Text (English) 	        LDC94T4B-1  

 1000  	 1      UN Parallel Text (French) 	        LDC94T4B-2  

 1000  	 1      UN Parallel Text (Spanish) 	        LDC94T4B-3.1

   35  	 1      ECI Multilingual Text 		        LDC94T5

10000  	 1      COMLEX English Syntax Lexicon, V.0      LDC94L2   

10000  	 1      COMLEX Pronouncing Dictionary, V.0      LDC94L3  


Price  Set-of   Description                             LDC Catalog #

  100   1       TIMIT                                   LDC93S1  (13)

  250   2       NTIMIT                                  LDC93S2 

  750   6       Resource Management Complete Set        LDC93S3A (28)

  500   6       ATIS0 Complete Corpora Set              LDC93S4A (31)

 1000   4       ATIS2                                   LDC93S5  (46)

 1000   15      CSR-I (WSJ0) Complete                   LDC93S6A

10000   28      SWITCHBOARD                             LDC93S7  (7)

 1000   1       SWITCHBOARD Credit Card                 LDC93S8  (3)

  125   1       TI 46-Word                              LDC93S9  (26)

  250   3       TIDIGITS                                LDC93S10 (8)

  200   8       HCRC Map Task Corpus                    LDC93S12 (47)

  100   1       ACL/DCI                                 LDC93T1

 2500   1       Tipster Complete                        LDC93T3

 1000   1       TIPSTER Volume 1                        LDC93T3-1.1

 1000   1       TIPSTER Volume 2                        LDC93T3-2.1

 1000   1       TIPSTER Volume 3                        LDC93T3-3.1

Speech Corpora: Descriptions and Ordering Information

TIMIT Acoustic-Phonetic Continuous Speech Corpora

The TIMIT corpus of read speech is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. TIMIT contains broadband recordings of 630 speakers of 8 major dialects of American English, each reading 10 phonetically rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic, and word transcriptions as well as a 16-bit, 16kHz speech waveform file for each utterance. Corpus design was a joint effort among the Massachusetts Institute of Technology (MIT), SRI International (SRI), and Texas Instruments, Inc. (TI). The speech was recorded at TI, transcribed at MIT, and verified and prepared for CD-ROM production by the National Institute of Standards and Technology (NIST).

The TIMIT corpus transcriptions have been hand verified. Test and training subsets, balanced for phonetic and dialectal coverage, are specified. Tabular computer-searchable information is included as well as written documentation.

Original ARPA-sponsored Version (TIMIT)

This is the original 16 kHz version, recorded over a high quality microphone in studio conditions.

README file is available.

  Item Name:		TIMIT 

  LDC Catalog No.:      LDC93S1

  NIST Catalog No.: 	1-1 

  Release date:		10/90  (MY93) 

  Nonmember price: 	$100 

  Special license:	NO 

NYNEX Telephone Version of TIMIT Corpus (NTIMIT)

The NTIMIT corpus was developed by the NYNEX Science and Technology Speech Communication Group to provide a telephone bandwidth adjunct to TIMIT.

NTIMIT was collected by transmitting all 6300 original TIMIT recordings through a telephone handset and over various channels in the NYNEX telephone network and redigitizing them. The recordings were transmitted through ten Local Access and Transport Areas, half of which required the use of long-distance carriers.

In order to calibrate the transmission characteristics of the various channels, stationary 1 kHz and frequency-sweeping tones were also recorded for each of the transmission channels. These are found on disc 2.

The re-recorded waveforms were time-aligned with the original TIMIT waveforms so that the TIMIT time-aligned transcriptions can be used with the NTIMIT corpus as well. In additiont to the documentation on the disc, see Jankowski et al., "NTIMIT: A Phonetically Balanced, Continuous Speech, Telephone Bandwidth Speech Database," Proc. ICASSP-90, April 1990. NYNEX retains full copyright on the corpus and all associated materials.

README file is available.

  Item Name:		NTIMIT  

  LDC Catalog No.:  	LDC93S2

  NIST Catalog No.: 	10-1.1, 10-2.1  

  LDC Release date:	8/92 (MY93) 

  Nonmember price: 	$250  

  Special license:	NO  


The Resource Management Corpora

The DARPA Resource Management Continuous Speech Corpora (RM) consist of digitized and transcribed speech for use in designing and evaluating continuous speech recognition systems. There are two main sections, often referred to as RM1 and RM2. RM1 contains four CD-ROMs, two with Speaker-Dependent (SD) training data, one with Speaker-Independent (SI) training data, and one with test and evaluation data. RM2 has 2 CD-ROMs with an additional and larger SD data set, including test material.

All RM material consists of read sentences modeled after a naval resource management task. The complete corpus contains over 25,000 utterances from more than 160 speakers representing a variety of American dialects. The material was recorded at 16KHz, with 16-bit resolution, using a Sennheiser HMD-414 headset microphone. All discs conform to the ISO-9660 data format.

RM sentences are consistent with a limited language model with a 1000 word vocabulary that allows queries about ships, ports, etc., along with commands to control a graphics display system, but little else. There is no "official" language model, but a simple non-probabilistic word-pair grammar that provides complete coverage of the sentences in this corpus is provided.

The Resource Management text corpus was designed at BBN Laboratories, Inc. and SRI International. BBN also developed and made available the "Word-Pair" grammar that has been used in the benchmark tests. Texas Instruments, Inc. recruited the subjects and recorded and digitized the speech. For more information about the design and collection of this corpus see: P. Price, W.M. Fisher, J. Bernstein and D.S. Pallett, "The DARPA 1000-Word Resource Management Database for Continuous Speech Recognition", Proceedings of the 1988 International Conference on Acoustics, Speech and Signal Processing (Paper S.13.21, pp. 651- 654).

A series of benchmark speech recognition performance assessment tests were conducted beginning in March 1987 using this corpus in conjunction with standardized scoring software. For more information see D.S. Pallett, "Benchmark Tests for DARPA Resource Management Database Performance Evaluations", in Proceedings of the 1989 International Conference on Acoustics, Speech and Signal Processing (Paper S10.b.6, pp. 536-539) and related papers in the Proceedings of the February 1989, October 1989, June 1990, and February 1991 DARPA Speech and Natural Language Workshops.

Resource Managment SD and SI Training and Test Data (RM1)

The first two CD-ROMs contain Speaker-Dependent (SD) Training Data: 12 subjects, each reading a set of 600 "training sentences", 2 "dialect" sentences, and 10 "rapid adaptation" sentences, for a total of 7344 recorded sentence utterances. The 600 sentences designated as training cover 97 of the lexical items in the corpus.

The third CD-ROM contains the Speaker-Independent (SI) Training Data: 80 speakers each read the 2 "dialect" sentences plus 40 sentences from the Resource Management text corpus, for a total of 3360 recorded sentence utterances. Any given sentence from a set of 1600 Resource Management sentence texts was recorded by two subjects, while no sentence was read twice by the same subject.

The fourth CD-ROM contains all SD and SI system test material used in 5 DARPA benchmark tests conducted in March and October of 1987, June 1988, and February and October 1989, along with scoring and diagnostic software and documentation for those tests. Documentation is also provided outlining use of the Resource Management training and test material at CMU in development of the SPHINX system. Example output and scored results for state-of-the-art speaker-dependent and speaker-independent systems (i.e., the BBN BYBLOS and CMU SPHINX systems) for the October 1989 benchmark tests are included, as well as SPeech HEader REsources (SPHERE) software and SPHERE-to-SAM conversion software.

README file is available.

Extended Resource Management Speaker-Dependent Corpus (RM2)

This 2-disc set forms a speaker-dependent extension to the Resource Management (RM1) corpus. The corpus consists of a total of 10,508 sentence utterances (2 male and 2 female speakers each speaking 2,652 sentence texts). These include the 600 "standard" Resource Management speaker-dependent training sentences, 2 dialect calibration sentences, 10 rapid adaptation sentences, 1800 newly-generated extended training sentences, 120 newly-generated development-test sentences, and 120 newly-generated evaluation-test sentences. The evaluation-test material on the discs was used as the test set for the June 1990 DARPA SLS Resource Management Benchmark Tests (see the Proceedings.)

The RM2 corpus was recorded at Texas Instruments. The NIST speech recognition scoring software originally distributed on the RM1 "Test" Disc was adapted for RM2 sentences, and is included on these discs as well as the SPHERE speech file header manipulation software.

README file is available.

  Item Name:            RM Complete

  LDC Catalog No.:      LDC93S3A

  NIST Catalog No.:     2-1.1 through 2-4.2, 3-1.2 and 3-2.2

  LDC Release date:     MY93

  Nonmember price:      750

  Special license:      NO

Air Travel Information System (ATIS) Corpora

During 1989 and 1990, the DARPA Spoken Language Systems (SLS) Program initiated plans for development of a "common corpus" for both speech recognition and natural language research, using "spontaneous goal-directed" speech, rather than "read speech." The common task domain that was chosen is termed the "Air Travel Information System" (ATIS). The corpora developed to date in order to train and test systems in this domain are known as ATIS0, ATIS2, and ATIS3. (ATIS1 will not be published.)

In all the ATIS corpora, users make spoken inquiries to simulated (ATIS0) or prototypical (ATIS2, ATIS3) speech understanding systems to obtain air travel information. The system has the information in the form of a relational database derived from the Official Airline Guide; the initial ATIS0 relational database, for example, contains information relevant to travel among 9 major airports serving 11 cities. To measure performance, the system's answers to the spoken inquiries are expressed in a logical form known as the "canonical answer specification" (CAS) language, and compared with canonical answers reviewed by human experts. There are thus a number of auxiliary files associated with each utterance, including orthographic transcriptions and, for answerable queries, ``reference answers''.

Texas Instruments developed ATIS0, the pilot corpus for this program, using a "Wizard of Oz" technique to simulate an ATIS SLS. (See Hemphill, Godfrey and Doddington's paper ``The ATIS Spoken Language Systems Pilot Corpus'' in the Proceedings of the June 1990 DARPA Speech and Natural Language Workshop.)

Since 1991, the data for ATIS2 and ATIS3 have been collected at multiple sites and pooled for common use. The number of speakers and utterances, the coverage of the travel information database, the collection scenarios and platforms, have all changed as documented in each corpus section.

For further information on the ATIS domain, on the test paradigm, and on ATIS-domain benchmark tests, see the Proceedings of the DARPA Speech and Natural Language Workshops held in October 1989, June 1990 and February 1991. (Morgan Kaufman, Publishers, Inc., 2929 Campus Drive, San Mateo, CA 94403. ISBN numbers: 1-55860-112-0, 1-55860-157-0, and 1-55860-207-0.)

ATIS0 Spontaneous Speech Pilot Corpus and Relational Database

The ATIS0 Corpus totals 6 CD-ROMs: one with spontaneous data from 36 speakers; one with read versions of the data from 20 of those speakers, along with some adaptation material; and four with extensive speaker dependent material from the ATIS domain, read by 10 of the same speakers.

All ATIS speech data is recorded at 16kHz sample rate, 16 bit quantization, from two different microphones, a close-talking (Sennheiser HMD414) and a desk-top (Crown PCC-160) model.

The first disc (ATIS0 Pilot) contains spontaneous utterances elicited in a "Wizard-of-Oz" simulation, along with the relational database containing the travel information (excluding connecting flights). Thirty-six speakers produced a total of 912 utterances.

The second disc (ATIS0 Read) contains ``read'' versions of the spontaneous utterances for 20 of the 36 speakers above, for a total of 478 productions. This is supplemented by a set of 40 ``adaptation'' sentences read by each of the 20 speakers.

The third through the sixth discs (ATIS0 SD-Read) contain "read" speech in the ATIS domain for ten of the speakers on the first disc. They read a total of 3171 utterances, or approximately 317 utterances per speaker. This data was collected for the purpose of training speaker-dependent speech recognition systems for the ATIS0 domain. Two of these four discs contain the close-talking (Sennheiser) microphone data, and the other two contain corresponding data for the desk-top (Crown PCC-160) microphone. Thus there are 6342 waveform files on the four discs.

README file can be reached here.

The entire ATIS0 set of six discs is now offered at a reduced price:

  Item Name:		ATIS0 Complete 

  LDC Catalog No.:  	LDC93S4A

  NIST Catalog No.: 	5-1.1 through 5-6.1  

  LDC Release date:	4/94 (MY93) 

  Nonmember price: 	1000  

  Special license:	NO  


The ATIS2 corpus, on four CD-ROMs, contains approximately 15,000 utterances recorded from approximately 450 subjects at five sites: ATT, BBN, CMU, MIT's Laboratory for Computer Science, and SRI. All utterances are been transcribed and almost 10,000 of them annotated with categorizations and canonical reference answers. Unlike the ATIS0 corpus, much of the data in ATIS2 was collected using partially or fully-automated data collection systems. The fully-automated data collection systems were, in fact, working ATIS prototypes.

For ATIS2, the 10-city relational database of ATIS0 was revised to accommodate connecting flights and fares and some table headings were renamed.

In addition to training data, the February and November '92 ATIS Benchmark Tests are included as well. Each contains approximately 1,000 utterances from the pool of data collected by the five sites.

Documentation is available.

  Item Name:		ATIS2  

  LDC Catalog No.:  	LDC93S5

  NIST Catalog No.: 	12-1.1 through 12-4.1  

  LDC Release date:	4/92 (MY93) 

  Nonmember price: 	1000  

  Special license:	NO  

ATIS3 Training Data

The ATIS3 corpus, on three CD-ROMs, includes over 774 scenarios completed by 137 subjects, yielding a total of over 7,300 utterances. All utterances are transcribed and 2,900 of them have been categorized and annotated with canonical reference answers.

The relational database for this dataset included flight information for 46 cities and 52 airports. Data was collected at BBN, CMU, MIT, and SRI, using their own ATIS systems, and at NIST using systems provided by BBN and SRI.

Two 1000-utterance test sets were set aside from the data pooled by the collection sites. The first set was used in a December 1993 ARPA test, and is included in ATIS3. The second has been reserved for future testing.

Documentation is available.

  Item Name:		ATIS3-1  

  LDC Catalog No.:  	LDC94S19

  NIST Catalog No.: 	17-1.1 through 17-3.1

  LDC Release date:	8/94 (MY94) 

  Nonmember price: 	2500

  Special license:      NO  

ATIS3-Test Data

This set of discs contains a corpus of speech and natural language data collected under the auspices of the Advanced Research Projects Agency Spoken Language Systems (ARPA-SLS) technology development program. The corpus, which contains data in the Air Travel Information Services (ATIS) domain, was designed by the ARPA-SLS Multi-Site Atis Data COllection Working (MADCOW) group and was collected by five sites at locations across the U.S.:

BBN Systems & Technologies, Cambridge, MA
Carnegie Mellon University, Pittsburgh, PA
MIT Laboratory for Computer Science, Boston, MA
National Institute of Standards and Technology, Gaithersburg, MD
SRI International, Menlo Park, CA

The corpora on this set of discs is part of the third phase of collection of ATIS data (ATIS3) and comprises the development test (NIST Speech Disc 17-4.2) and evaluation test material (NIST Speech Disc 17-5.1) used in the December 1994 ARPA SLS Benchmark Tests. As in the previous ATIS corpora, the speech contained in this corpus was elicited by presenting subjects with various hypothetical travel planning scenarios to solve. The resulting spontaneous spoken queries were recorded as the subjects interacted withpartially or completely automated ATIS systems to solve the scenarios. Note that the ATIS3 training data is available on NIST Speech Discs 17-1.1-17-3.1.

The recorded speech has been transcribed and annotated with categorizations and canonical reference answers.All of the utterances on these discs have been recorded using a close-talking, noise-canceling head-mounted Sennheiser microphone. For some subjects, secondary (noisier) microphone data was recorded simultaneously as well.

These discs also contains the ATIS3 46 city/52 airport relational database, a revised Principles of Interpretation, and test implementation and scoring instructions as well as other general documentation.

The ATIS3 corpus has been verified, collated, documented and produced on CD-ROM by the National Institute of Standards and Technology (NIST) in cooperation with MADCOW and distributed by the Linguistic Data Consortium (LDC).

Documentation is available.

Item Name:              ATIS3-2  

LDC Catalog No.:        LDC95S26

NIST Catalog No.:       #17-4.2 through 17-5.1

LDC Release date:       7/95

Nonmember price:        $2000     

Special license:        NO

Continuous Speech Recognition (CSR) Corpora sponsored by ARPA

During 1991, the DARPA Spoken Language Program initiated efforts to build a new corpus to support research on large-vocabulary Continuous Speech Recognition (CSR) systems.

The first two CSR Corpora consist primarily of read speech with texts drawn from a machine-readable corpus of Wall Street Journal news text, and are thus often known as WSJ0 and WSJ1. (Later sections of the CSR set of corpora, howver, will consist of read texts from other sources of North American business news, and eventually from other news domains.)

The texts to be read were selected to fall within either a 5,000-word or a 20,000-word subset of the WSJ text corpus. (See the documentation for details.) Some spontaneous dictation is included in addition to the read speech. The dictation portion was collected using journalists who dictated hypothetical news articles.

Two microphones are used throughout: a close-talking Sennheiser HMD414, and a secondary microphone, which may vary. The corpora are thus offered in three configurations: the speech from the Sennheiser, the speech from the other microphone, and the speech from both; all three sets include all transcriptions, tests, documentation, etc.

In general, transcriptions of the speech, test data from ARPA evaluations, scores achieved by various speech recognition systems, and software used in scoring are included on separate discs from the waveform data.

ARPA Continuous Speech Recognition Corpus I: Wall Street Journal Sentences (WSJ0, or CSR-I)

MIT's Laboratory for Computer Science, SRI International and Texas Instruments collected approximately 40 hours of speech and over 31,000 utterances. Prompts were taken from the Wall Street Journal.

Development and evaluation test sets are included and so marked.

Documentation is available.

  Item Name:		CSR-I Complete 

  LDC Catalog No.:  	LDC93S6A

  NIST Catalog No.: 	11-1.1 through 11-12.1, 11-14.1, 11-15.1  

  LDC Release date:	7/93 (MY93) 

  Nonmember price: 	1,000  

  Special license:	NO  

ARPA Continuous Speech Recognition Corpus II: Wall Street Journal Sentences (WSJ1, or CSR-II)

The complete WSJ1 corpus contains approximately 78,000 training utterances ( 73 hours of speech), 4,000 of which are the result of spontaneous dictation by journalists with varying degrees of experience in dictation. The corpus contains approximately 8,200 "conventional" development test utterances ( 8 hours of speech), 6,800 of which are from spontaneous dictation. As with the pilot corpus, the entire corpus was collected using 2 microphones, so the amount of speech in the entire corpus is about 162 hours.

In early 1993, a "Hub and Spoke" test paradigm was designed, calling for eleven test sets, each a specific variation of the basic or ``hub'' condition. The eleven Hub and Spoke Development and Evaluation Test sets each contain approximately 7500 waveforms ( 11 hours of speech).

WSJ1 waveforms have been compressed by about 2:1 using the SPHERE-embedded ``Shorten'' compression algorithm developed at Cambridge University.

Documentation is available.


  Item Name:		CSR-II Complete 

  LDC Catalog No.:  	LDC94S13A

  NIST Catalog No.: 	13-1.1 through 13-34.1  

  LDC Release date:	7/93 (MY94) 

  Nonmember price: 	1,500  

  Special license:	NO  

1994 Benchmark Speech Test Collection for the ARPA Continuous Speech Recognition Program (CSR-III Speech)

The third ARPA Continuous Speech Recognition (CSR) Benchmark Speech Test Collection is a three CD-ROM set that contains complete development test and evaluation test suites for speaker-independent, large-vocabulary speech recognition systems.

The development and evaluation tests share a common structure, consisting of two core test components ("hubs") and seven specialized test components ("spokes"). The hub tests, which were mandatory for all ARPA CSR participants in the November '94 evaluations, provide a base- line for ASR performance, while the spokes provide the means for assessing the impact of particular speaking conditions or processing strategies in relation to baseline performance. Participants were free to take any combination of spoke tests according to their research interests). Taken together, the collection encompasses 180 speakers, each producing twenty to forty sentences. These are organized into two complete development test sets and one evaluation set.

The collection also includes complete documentation on the test specifications, data collection procedures, transcriptions, and scoring protocols, together with the latest available version of NIST software for scoring ASR results and managing SPHERE waveform files. All speech data is accompanied by both the prompting texts and the detailed orthographic transcriptions of the utterances.

This was the first ARPA CSR Benchmark Test in which prompting texts were drawn from a variety of news sources. Whereas earlier benchmarks were based on Wall Street Journal excerpts (from the period 1987-89), CSR-III prompts come a variety of North American Business News Services: Reuters News Service, New York Times, Wahington Post and Los Angeles Times as well as WSJ; all texts are drawn from financial news articles written during the period of April through June, 1994. (NAB stands for "North American Business", in contrast to earlier benchmarks and training collections labeled "WSJ".)

An important companion to the 1994 Benchmark Speech data collection is the 4-disk CSR-III Text Collection, which includes the ARPA CSR 1994 Standard Language Model. The collection comprises both source text data (prepared by LDC and BBN) and derived statistical tables (compiled by CMU) of unigram, bigram and trigram word frequencies. The sources include all available WSJ texts, spanning 1987 through March 1994, and all AP and San Jose Mercury news data from the three TIPSTER volumes. (Some of the WSJ data, from 1992 through 1994, appears here for research use for the first time.) This corpus is also available from the LDC as a 1995 release.

Because of restrictions imposed by the copyright holders of much of the NAB text, both the speech and text collections are available to LDC members only. For more information on how to join, send email to

README file is available.

  Item Name:            CSR-III Text:Language Model.

  LDC Catalog No.:      LDC95T6

  NIST Catalog No.:     NIST22-1.1-22-4.1,23-1-1,25-1.1-25-3.1

  LDC Release date:     2/95 (MY95)

  Nonmember price:      M/O

  Special license:      NO

  Item Name:		CSR-III Speech: Development and Evaluation Data.

  LDC Catalog No.:      LDC95S23

  NIST Catalog No.:     NIST22-1.1-22-4.1,23-1-1,25-1.1-25-3.1

  LDC Release date:     2/95 (MY95)

  Nonmember price:      M/O

  Special license:      NO

DARPA Continuous Speech Recognition Corpus-IV: Radio Broadcast News (CSRIV Hub-4)

This set of CD-ROMs contains all of the speech data provided to sites participating in the DARPA CSR November 1995 Hub-4 (Radio) Broadcast News tests. The data consists of digitized waveforms of MarketPlace (tm) business news radio shows provided by KUSC through an agreement with the Linguistic Data Consortium, and detailed transcriptions of those broadcasts. The software NIST used to process and score the output of the test systems is also included.

The data is organized as follows:

CD26-1: Training Data-Ten complete half-hour broadcasts with minimally-verified transcripts. The transcripts are time aligned with the waveforms at the story-boundary level.

CD26-2: Development-Test Data-Six complete half-hour broadcasts with verified transcripts. The transcripts are time aligned with the waveforms at the story-and turn-boundary level. Index files have been included which specify how the data may be partitioned into 2 test sets.

CD26-6 Evaluation-Test Data-Five complete half-hour broadcasts with verified/adjudicated transcripts. The transcripts are time aligned with the waveforms at the story-, turn-, and music-boundary level. An index file has been included which specifies how the data was partitioned into the test set used in the CSR 1995 Hub-4 tests.

  Item Name:		CSR-IV (Hub 4)

  LDC Catalog No.:      LDC96S31

  NIST Catalog No.:     NIST26-1.1-26-2.1,26-6-1

  LDC Release date:     5/96 (MY96)

  Nonmember price:      $2500

  Special license:      YES

DARPA Continuous Speech Recognition Corpus IV: (CSR-IV Hub-3)

This set of CD-ROMs contains all of the speech data provided to sites participating in the DARPA CSR November 1995 Hub-3 Mulit-Microphone tests. The data consists of digitized waveforms collected with eight different microphones simultaneously from 40 subjects reading 15 sentence articles drawn from various North American business news publications. The data is partitioned into development-test and evaluation-test sets. The test sets were collected with different subjects, prompts, and microphones. No training data was collected for this corpus since a substantial amount of NAB acoustic training data was already available. Index files have been included that specify the exact subset of the evaluation test recordings which were used in the November 1995 tests. The software NIST used to process and score the outputof the tests systems is also included.

The data is organized as follows:

CD26-3 Development-Test Data-Location 1, Adaptation and NAB recordings, Subjects:703-705, 707-70a, 70c, 70f, 70g

CD26-4 Development-Test Data-Location 2, NAB recordings, Subjects:70k, 70m, 70o, 70q-70s, 70u-70w

CD26-5 Development-Test Data-Location 2, Adaptation recordings, Subjects:70k 70m-70o, 70q-70s, 70u-70w

CD26-3 Development-Test Data-NAB recordings, Subjects:710-71j

  Item Name:		CSR-IV Hub 3)

  LDC Catalog No.:      LDC96S33

  NIST Catalog No.:     NIST26-3.1, 26-4.1, 26-5-1, 26-7.1

  LDC Release date:     6/96 (MY96)

  Nonmember price:      MO

  Special license:      YES

SWITCHBOARD Corpus of Recorded Telephone Conversations

SWITCHBOARD is a collection of about 2400 two-sided telephone conversations among 543 speakers (302 male, 241 female) from all areas of the United States. A computer-driven "robot operator" system handled the calls, giving the caller appropriate recorded prompts, selecting and dialing another person to take part in a conversation, introducing a topic for discussion, and recording the speech from the two subjects into separate channels until the conversation was finished. About 70 topics were provided, of which about 50 were used frequently. Selection of topics and callers was constrained so that: (1) no two speakers would converse together more than once, and (2) no one spoke more than once on a given topic.

Waveform files were recorded into two channels directly from the T1 digital telephone circuits, at an 8kHz sample rate and 8-bit mu-law quantization. Complete orthographic transcriptions were made for each conversation, with codes to identify overlapping portions (both speakers talking at the same time), certain non-speech events (laughter, coughs, etc), and interruptions/hesitations. Each conversation was also rated by transcribers for various quality factors (amount of cross-talk between channels, static and background noise, topicality, etc). In addition, each transcription was verified, and then used in a forced speech-recognition algorithm to establish timing marks for word and utterance boundaries; transcriptions are provided in the corpus in both "plain text" and "time-aligned" forms. A description is published in the 1993 ICASSP Proceedings: Godfrey, McDaniel, and Holliman, ``SWITCHBOARD: A Telephone Speech Corpus for Research and Develpment.''

The original issue of SWITCHBOARD in early 1993 lacked about 150 conversations which were intended for publication but omitted by error. They were published in May 1994 and distributed to all previous recipients of SWITCHBOARD.

The Switchboard Corpus was collected at Texas Instruments and produced on CD-ROM at the National Institute of Standards and Technology. It is distributed in a notebook-style binder with 28 CD-ROMs, (27 containing speech data, and one containing all transcription data). Preparation of the data for CD-ROM production was done by NIST. The waveform files use the NIST SPHERE format.

README file is available.


  Item Name:		SWITCHBOARD 

  LDC Catalog No.:  	LDC93S7

  NIST Catalog No.: 	9-1.1, 9-3.1 through 9-29.1  

  LDC Release date:	4/92 (MY93) 

  Nonmember price: 	10000  

  Special license:	NO  

SWITCHBOARD Corpus Excerpts, Credit Card Conversations

This CD-ROM contains 35 conversations on the topic of ``Credit Card Use''. Most but not all can also be found in the Switchboard Corpus (see below). The conversations can be used in training and testing wordspotting systems. In addition to 2-channel mu-law encoded audio waveform files, the disc contains transcriptions, time-alignments, and wordspotting targets.

README file is available.


  Item Name:		SWITCHBOARD Credit Card  

  LDC Catalog No.:  	LDC93S8

  NIST Catalog No.: 	8-1.2  

  LDC Release date:	5/92 (MY93) 

  Nonmember price: 	1000  

  Special license:	NO  

Texas Instruments 46-Word Speaker-Dependent Isolated Word Corpus (TI46)

This CD-ROM contains a corpus of speech which was originally designed and collected at Texas Instruments, Inc. (TI) in 1980, and used initially in performance assessment tests of isolated-word speaker-dependent technology. (See ``Speech Recognition: Turning Theory to Practice'' by G. R. Doddington and T. B. Schalk, in IEEE Spectrum, Vol. 18, No. 9, September 1981.)

The 46-word vocabulary consists of two sub-vocabularies: (1) the TI 20-word vocabulary (consisting of the digits zero through nine plus the words "enter", "erase", "go", "help", "no", "rubout", "repeat", "stop", "start", and "yes", and (2) the TI 26-word "alphabet set" (consisting of the letters "a" through "z").

The corpus contains read utterances from 16 speakers (8 males and 8 females) each speaking 26 utterances of the 46-word vocabulary: 16 tokens designated as training and 10 as test.

The corpus was collected at Texas Instruments in a quiet acoustic enclosure using an Electro-Voice RE-16 Dynamic Cardiod microphone at 12.5kHz sample rate with 12-bit quantization. The files are in NIST SPHERE format, and have a ".wav" filename extension.

README file can be reached here.

  Item Name:		TI 46 Word  

  LDC Catalog No.:  	LDC93S9

  NIST Catalog No.: 	7-1.1  

  LDC Release date:	4/92 (MY93) 

  Nonmember price: 	125  

  Special license:	NO  

Texas Instruments Speaker-Independent Connected-Digit Corpus (TIDIGITS)

This three-disc set contains speech which was originally designed and collected at Texas Instruments, Inc. (TI) for the purpose of designing and evaluating algorithms for speaker-independent recognition of connected digit sequences. There are 326 speakers (111 men, 114 women, 50 boys, and 51 girls) each pronouncing 77 digit sequences. Each speaker group is partitioned into test and training subsets.

The corpus was collected at TI in 1982 in a quiet acoustic enclosure using an Electro-Voice RE-16 Dynamic Cardiod microphone, digitized at 20kHz. The waveform files are in the NIST SPHERE format.

README file is available.


  Item Name:		TIDIGITS  

  LDC Catalog No.:  	LDC93S10

  NIST Catalog No.: 	4-1, 4-2, 4-3  

  LDC Release date:	4/92 (MY93) 

  Nonmember price: 	250  

  Special license:	NO  

The HCRC Map Task Corpus

The Map Task Corpus is a set of 8 CD-ROMs containing a total of about 18 hours of spontaneous speech that was recorded from 128 two-person conversations, involving 64 different speakers (32 female, 32 male, all adults, each taking part in four conversations). The 64 speakers were all students at the University of Glasgow, 61 of them being native Scots. The conversations were carried out in an experimental setting, in which each participant has a schematic map in front of them, not visible to the other. Each map is comprised of an outline and roughly a dozen labelled features (e.g. "a white cottage", "an oak forest", "Green Bay", etc). Most features are common to the two maps, but not all. One map has a route drawn in, the other does not. The task is for the participant without the route to draw one on the basis of discussion with the participant with the route. In addition to the conversations, each speaker provides a wordlist reading, consisting of the major vocabulary items contained in the conversations.

The experimental design allows a number of different phonemic, syntactico-semantic and pragmatic contrasts to be explored in a controlled way. In particular, maps and feature names were designed to allow for controlled exploration of phonological reductions of various kinds in a number of different referential contexts, and to provide, via varying patterns of matches and mis-matches between the two maps, a range of different stimuli for referent negotiation. Also the conditions of the conversations were carefully balanced: In half of them the talkers were strangers, in half friends; in half of them the talkers could see each other's faces, in half they could not.

The waveform data are provided in "raw" (headerless) files (16-bit samples, 20 kHz sample rate, 2 channels per conversation), and alternative header files are provided for use with software based on either the NIST ``SPHERE'' header structure or the European ``SAM'' header structure. Text transcriptions are provided for each conversation, along with PostScript files of the map images used in the experiments. Additional materials include full documentation of the experimental design and data collection protocol, resources for using SGML tools on the transcriptions and other text materials, and an extensive set of source code for performing basic signal processing functions on the waveform data, such as down-sampling, de-multiplexing, channel summation, and D/A conversion for Sun workstations (including playback of segments selected via inspection of transcripts in Emacs).

README file is available.

  Item Name:		HCRC MAP TASK  

  LDC Catalog No.:  	LDC93S12

  NIST Catalog No.: 	NA  

  LDC Release date:	4/92 (MY93) 

  Nonmember price: 	200  

  Special license:	NO  

Air Traffic Control Corpus (ATC0)

The Air Traffic Control Corpus (ATC0) is an eight-disc set of recorded speech for use in supporting research and development activities in the area of robust speech recognition in domains similar to air traffic control (several speakers, noisy channels, relatively small vocabulary, constrained languaged, etc.) The audio data on these discs is composed of voice communication traffic between various controllers and pilots.

The audio files are 8 KHz, 16-bit linear sampled data, representing continuous monitoring, without squelch or silence elimination, of a single FAA frequency for one to two hours. There are also files which indicate the amplitude of the received AM carrier signal at 10 msec. intervals.

Full transcripts, including the start and end times of each transmission, are provided for each audio file. Each flight is identified by its flight number.

ATC0 consists of three subcorpora, one for each airport in which the transmissions were collected -- Dallas Fort Worth (DFW), Logan International (BOS), and Washington National (DCA). The complete set contains approximately 70 hours of controller and pilot transmissions collected via antennas and radio receivers which were located in the vicinity of the respective airports.

Detailed information regarding the collection process and the equipment used can be found on each disc in the file, ``atc.doc'' in the ``doc'' directory.

The ATC0 Corpus was collected by Texas Instruments under contract to ARPA. It was produced on CD-ROM by the National Institute of Standards and Technology for distribution by the Linguistic Data Consortium.

README file is available.


   LDC Catalog No.:  	LDC94S14

   NIST Catalog No.: 	16-1.1 through 16-8.1  

   LDC Release date:	3/94 (MY94) 

   Nonmember price: 	1250  

   Special license:	NO    

SPIDRE Speaker Identification Corpus

This is 2-CD subset of the SWITCHBOARD collection (see above), selected for speaker ID research, and with special attention to telephone instrument variation. It contains training and testing data for experiments in closed or open set recognition or verification. Combining the two sides of the conversations also permits speaker change detection, or speaker monitoring, experiments.

There are 45 ``target'' speakers; four conversations from each target are included, of which two are from the same handset. There are also 100 calls in which no target appears. Since all conversations are two-sided, this results in 180 target sides and 180 + 200 = 380 nontarget sides.

Except for truncations of a few longer calls at 5 minutes, the call themselves are as described under SWITCHBOARD.

README file.

  Item Name:		SPIDRE  

  LDC Catalog No.:  	LDC94S15 

  NIST Catalog No.: 	18-1.1 and 18-2.1  

  LDC Release date:	4/94 (MY94) 

  Nonmember price: 	2000  

  Special license:	NO  

YOHO Speaker Verification Corpus

The YOHO database is a three-disc set containing a large scale, high-quality speech corpus to support text-dependent speaker authentication research, such as is used in "secure access" technology. The data was collected in 1989 by ITT under a US Government contract, but has not been available for public use before. Note that certain changes have been made to the corpus, mainly to insure the privacy of the speakers, and some data has been withheld by the government for future use in testing.

YOHO contains:

The number of trials is thus sufficient to permit evaluation testing at high confidence levels. In each session, a speaker was prompted with a series of phrases to be read aloud; each phrase was a sequence of three two-digit numbers (e.g. ``35 - 72 - 41,'' pronounced ``thirty-five seventy-two forty-one''). The first four sessions for a given speaker were enrollment sessions of 24 phrases, and all additional sessions were verification trials of four phrases each. In all there are 552 enrollment sessions, and 1380 trial sessions, with a nominal time interval of three days between sessions.

README file is available.

  Item Name:		YOHO  

  LDC Catalog No.:	LDC94S16 

  NIST Catalog No.:	NA  

  LDC Release date:	4/94 (MY94)  

  Nonmember price:	750 

  Special License: 	NO  

OGI Multi-Language Corpus

The corpus consists of responses to prompts spoken over commercial telephone lines by speakers of English, Farsi(Persian), French, German, Hindi, Japanese, Korean, Mandarin Chinese, Spanish, Tamil and Vietnamese. It contains a total of 1927 calls, an average of 175 calls per language.

Speech was collected using an automated system that answered the telephone, played digitized prompts in the appropriate language to request the speech samples, and digitized the callers' responses for a designated period of time.

Log files are included that provide a set of automatic measurements made on each utterance. In addition, some utterances were automatically segmented into broad phonetic catagories. The speech data are compressed, with NIST SPHERE headers.

To read the README file click here


  LDC Catalog No.:  	LDC94S17   

  NIST Catalog No.: 	NA  

  LDC Release date:	4/94 (MY94) 

  Nonmember price: 	200  

  Special license:	NO  

OGI Spelled and Spoken Telephone Corpus

The OGI Spelled and Spoken Telephone Corpus consists of speech recordings from over 3650 telephone calls, each made by a different speaker to an automated prompting/recording system installed at the Oregon Graduate Institute. Speakers were asked to say their name, where they were calling from, and where they grew up; they were asked to answer a couple of yes/no questions, and to spell their first and last names; many were also asked to repeat a few specific words, and to recite the letters of the alphabet.

Each response to a prompt is stored as a separate waveform file, and the files are organized according to prompt (response type); all responses from a given call have a unique caller-index number as part of the file named, so that responses can easily be sorted by speaker. Waveform data are stored in compressed form, using the NIST SPHERE 2.0 software package, which is available separately at no charge to users. SPHERE 2.0 provides the decompression software needed to extract the waveform data, as well as tools for accessing and modifying file headers.

Time-aligned phonetic transcriptions are provided for a subset of responses, and a complete log of each (giving speaker sex, quality judgments, and orthographic transcriptions of all responses) is included in a form suitable for use as a relational data base.

README file is available.

  Item Name:            OGI SPELLED SPOKEN WORD  

  LDC Catalog No.:      LDC94S18  

  NIST Catalog No.:     NA  

  LDC Release date:     4/94 (MY94) 

  Nonmember price:      100  

  Special license:      NO  


The recordings on this nine-disc set were originally made in 1978-79 as part of a British Home Office study into speaker identification techniques. Subsequently, it was realised that a large body of unconstrained conversational material might be of interest to researchers working in other speech processing fields. The recordings were transcribed and the CD-ROMs prepared during 1993.

The recordings were made at the Police Staff College, Bramshill, Hampshire, England. The participants were police officers taking part in the various courses at the college. This provided a wide range of regional accents and a range of ages from late teens to early fifties. Each speaker is described by nine demographic attributes.

Three adjacent bedrooms were used. The two participants, each alone in their rooms, conversed by telephone. The third room was used as a monitoring and recording station.

In addition to the telephone recordings, reference recordings were made using a high quality dynamic microphone in each room. It is these higher quality recordings, not the telephone speech, which are provided on the BRAMSHILL CD-ROM set.

The recordings were made on a Sony Elcaset EL-7 cassette machine, chosen at the time because of its good speed stability. The microphone was a Shure SM-7 cardioid type. The speech data was sampled at 10 kHz, 16-bit resolution.

Some attempt was made to control the acoustic environment. It is evident from listening to the recordings that, while these measures produced a reasonable recording environment, the rooms were far from soundproof. A variety of external noises (engines, aircraft, etc) can be heard on some of the recordings.

Each speaker was given a pile of photographs. In response to a bleep signal, each speaker introduced himself by name and read a set of test sentences. After this, the main part of the conversation took place, in which participants were asked to determine which of each pair of photographs has been taken first (if indeed they were related at all). The conversations continued for 10 minutes until terminated by another bleep signal.

During the digitisation process, some periods of silence were removed, so some recordings now appear to be shorter than the original ten minutes. Furthermore, this means that recordings of two sides of a conversation are no longer time-aligned. In addition, to preserve the anonymity of the speakers, some passages (mainly the introductions) have been erased by replacing with binary zeroes. Finally the bleep signals have also been erased with binary zeroes. The transcriptions indicate where this has occurred.

The speech was transcribed verbatim. No attempt was made to correct grammar, fill in missing words etc. Transcription conventions are detailed in the documentation. Every lexical word from the transcriptions is contained in the dictionary supplied in the INDEX directory. There are about 6500 word types in the 600k words of the transcripts. Contractions, part-words, slang words, hesitation sounds and the non-speech sounds such are all treated as words in their own right in the dictionary.

  Item Name:            BRAMSHILL  

  LDC Catalog No.:      LDC94S20 

  NIST Catalog No.:     NA  

  LDC Release Date:     8/94 (MY94)  

  Nonmember price:      500  

  Special license:      NO  


MACROPHONE consists of approximately 200,000 utterances by 5000 speakers. It is designed to provide material sufficient and suitable for research, development, and evaluation of automatic speech recognition technology for common telephone applications, such as shopping, transportation, database access, and autodialing. In addition to application-oriented phrases and numerous digit strings, seven sentences are spoken by each talker to provide ensemble phoneme, diphone and triphone coverage of the language. The spoken material also refers to times, locations, monetary amounts, spellings, and interactive operations.

The utterances were collected automatically over the telephone network by recording directly from a T1 connection in 8 kHz, 8-bit mu-law format. The participants, roughly equal numbers of males and females, were solicited by a marketing firm from all regions of the United States. They ranged in age from the teens to the seventies, and represented a broad range of educations and incomes as well. Each recorded utterance is accompanied by an orthographic transcription which also notes any unusual acoustic events or anomalies. Macrophone is the American English contribution to an international database of telephone speech corpora called POLYPHONE. Similar data sets are expected for major languages of the world, and at least some of these will be made available through LDC. Prospects are currently good for American Spanish (by early 1995), Dutch, Standard French, Standard German, Japanese, Mandarin Chinese, Swiss French, and Danish versions of POLYPHONE, all with basically the same structure and methods of collection.

MACROPHONE was collected at SRI under LDC sponsorship. A paper describing it was presented at ICASSP-94: ``Macrophone: An American English Telephone Speech Corpus for the POLYPHONE Project,'' by Jared Bernstein, Kelsey Taussig, and Jack Godfrey.

README file is available.

  Item Name:		MACROPHONE  

  LDC Catalog No.:  	LDC94S21

  NIST Catalog No.: 	NA  

  LDC Release date:	August 1994 (MY94) 

  Nonmember price: 	10000  

  Special license:	NO  

The KING-92 Corpus for Speaker Verification Research

The KING corpus was collected at ITT in 1987 under a US government research contract, and although other contractors have received it, it has not been officially available for public use before now. The version now available from LDC, referred to as KING-92, is based on a 1992 reprocessing of the original recordings (see below). It contains recorded speech from 51 male speakers in two versions, which differ in channel characteristics: one from a telephone handset and one from a high-quality microphone. The speakers are further subdivided into two groups, 25 in one and 26 in the other, who were recorded at different locations. For each speaker and channel there are ten files, corresponding to sessions of about 30 to 60 seconds' duration each. The interval between sessions varies from a week to a month. The transcripts contain about 54k word tokens (4.8k types).

KING is designed principally for closed set experiments in text-independent speaker identification or verification over toll-quality telephone lines, although the single-sided collection format does not permit simulation of real telephone traffic. The ten sessions allow for a variety of divisions into training and test data, with the possibility of multiple test sets. For example, one could examine the effects of the amount of training on performance, or examine the variability of performance over several test samples (sessions) given a fixed amount of training (but see below about the "Great Divide".)

The collection method used in KING was to establish a call from a laboratory location at ITT (either San Diego, CA or Nutley, NJ) over long distance lines and back to another phone at the same location. The phones used by the test subjects were equipped with an additional microphone, so two parallel recordings were made of that side of the conversation, while the interlocutor's side was not recorded. The two parties either spoke spontaneously or carried out a variety of tasks designed to elicit natural-sounding speech: interpreting a drawing, solving a problem, describing a picture, etc.

There were 25 speakers in Nutley and 26 in San Diego. Speech-to-noise ratios average about 10 dB worse for the Nutley telephone data than for San Diego; in fact it is less than 20 dB for over half the Nutley files. Users of this corpus therefore usually run separate experiments, or at least report results separately, according to site. A more subtle difference in the recordings, however, sometimes referred to as the ``Great Divide,'' cuts across the telephone data for the San Diego speakers. This was apparently due to a minor equipment change which was made during the collection; it results in a slight but consistent change in the average long term spectrum of the telephone data recorded after the fifth session. Training and testing on data from the same side of this divide gives significantly better results than across it. Since the discovery of this difference, investigators now generally report results on the first and last five sessions of the San Diego telephone KING data separately, or they report within vs. across this boundary. A detailed description of the spectral differences can be found in a report by Thomas Crystal and Ned Neuburg which accompanies the CD-ROM version.

Since there are a number of published papers with results based on the original KING corpus, and two versions of the data in existence, note that the new CD-ROM version, called KING-92, is based on a 1992 re-issue of the data from ITT. It differs from the original corpus in a few details:

README file is available.

  Item Name:            KING

  LDC Catalog No.:      LDC95S2

  NIST Catalog No.:     NA

  LDC Release date:     4/95 (MY95)

  Nonmember price:      $2500

  Special license:      NO


A British English Speech Corpus for Large Vocabulary Continuous Speech Recognition
(The Cambridge University Version of the ARPA CSR Corpus "WSJ0")

This release of WSJCAM0 on CD-ROM represents version 1.1 of the corpus, which was initially released on tape by Cambridge University as of 31 August, 1994. This collection is modelled directly on the initial ARPA CSR Corpus (WSJ0, a fifteen-disc corpus released by LDC in 1993): it uses the same dual-microphone recording paradigm and a subset of prompting texts drawn from the Wall Street Journal.

There are two key differences between WSJ0 and WSJCAM0: (1) the subjects in WSJCAM0 are native speakers of British English, and (2) in addition to standard orthographic transcripts, WSJCAM0 also has information on the time alignment between the sampled waveform and both the words and the phonetic segments.

The CD-ROM publication consists of six discs, with contents organized as follows:

There are 90 utterances from each of 92 speakers that are designated as training material for speech recognition algorithms. An additional 48 speakers each read 40 sentences containing only words from a fixed 5,000 word vocabulary, and another 40 sentences using a 64,000 word vocabulary, to be used as testing material. Each of the total of 140 speakers also recorded a common set of 18 adaptation sentences. Recordings were made from two microphones: a far-field desk microphone and a head-mounted close-talking microphone.

Within the train and test sets, speech data are organized by speaker; prompting texts, detailed transcriptions and speaker information are included in each speaker directory.

All waveform files have NIST SPHERE headers; waveform data are compressed using the "Shorten" algorithm developed by Tony Robinson at Cambridge University, as adapted for use in the NIST SPHERE software package. (This package is available via anonymous ftp from NIST, on ftp server "" in the "pub" directory.) Complete documentation is provided on each disc in the set.

Documentation available.

  Item Name:		WSJCAM0

  LDC Catalog No.:  	LDC95S24  

  NIST Catalog No.: 	NA  

  LDC Release date:	February 1995 (MY95) 

  Nonmember price: 	2000  

  Special license:	NO 

The TRAINS Spoken Dialog Corpus

This CD-ROM contains a corpus of task-oriented spoken dialogs. These dialogs were collected as part of the TRAINS project, a project to develop a conversationally proficient planning assistant, which helps a user construct a plan to achieve some task involving the manufacturing and shipment of goods in a railroad freight system. The collection procedure was designed to make the setting as close to human-computer interaction as possible, but was not a “wizard?scenario, where one person pretends to be a computer. Thus these dialogs provide a snapshot into an ideal human-computer interface that would be able to engage in fluent conversations.

Altogether, this corpus includes 98 dialogs, collected using 20 different tasks and 34 different speakers. This amounts to six and a half hours of speech, about 5900 speaker turns, and 55000 transcribed words.

Documentation available.

Item Name:              TRAINS

LDC Catalog No.:        LDC95S25          

NIST Catalog No.:       NA  

LDC Release date:       5/95     

Nonmember price:        $2500     

Special license:        NO

The NYNEX Phonebook Database

PhoneBook is a phonetically-rich, isolated-word, telephone-speech database, created because of (1) the lack of available large-vocabulary isolated-word data, (2) anticipated continued importance of isolated-word and keyword-spotting technology to speech-recognition-based applications over the telephone, and (3) findings that continuous-speech training data is inferior to isolated-word training for isolated-word recognition.

The goal of PhoneBook is to serve as a large database of American English word utterances incorporating all phonemes in as many segmental/stress contexts as are likely to produce coarticulatory variations, while also spanning a variety of talkers and telephone transmission characteristics. We anticipate that it will be useful in ways analogous to TIMIT/NTIMIT.

The core section of PhoneBook consists of a total of 93,667 isolated-word utterances, totalling 23 hours of speech. This breaks down to 7979 distinct words, each said by an average of 11.7 talkers, with 1358 talkers each saying up to 75 words. All data were collected in 8-bit mu-law digital form directly from a T1 telephone line. Talkers were adult native speakers of American English chosen to be demographically representative of the U.S.

Given the large set of talkers being recruited for PhoneBook database, it made sense to exploit the opportunity to collect additional utterances. We have chosen spontaneous numerical utterances, because of widespread interest in them and the need for very large numbers of talkers for research into spontaneous- speech effects. We restricted to just three spontaneous digit sequences and one money amount, as the lists for the core of PhoneBook have been designed to approach the limit of reasonable duration for a caller's session. As a result, PhoneBook contains a total of 5105 spontaneous utterances.

Documentation available.

Item Name:              PHONEBOOK

LDC Catalog No.:        LDC95S27          

NIST Catalog No.:       NA  

LDC Release date:       7/95 

Nonmember price:        $5000  

Special license:        NO

LATINO-40 Spanish Read News Corpus.

This database provides a set of recordins for training speaker-independent systems that recognize Latin-American Spanish. It was recorded by the Entropic Research Laboratory in the period from July 11 through September 9 1994 in Palo Alto, California. The database comprises about 5000 utterances files. These files include about 125 utterances from each of 40 different speakers, 20 male and 20 female.

The recordings were all made with a high-quality, head-mounted microphone (Shure SM10A) in an office environment, and the utterances were digitized in 16-bit samples at 16 kHz.

The Linguistic Data Consortium provided 13,000 sentences that had been selected from Latin American newspaper text by people working at Texas Instruments.9 The sentences are all shorter than 80 characters, and are not grouped into larger constituents like paragraphs or stories. The speach files have NIST SPHERE headers, and are presented in compressed format, using the "shorten" speech compression algorithm developed by Tony Robinson at Cambridge Univesity, as implemented in the NIST SPHERE software package. This software is included on the CD-ROM with the data.

Documentation available.

Item Name:              LATINO40

LDC Catalog No.:        LDC95S28

NIST Catalog No.:       NA

LDC Release date:       11/95

Nonmember price:        $2000

Special license:        YES

Frontiers in Speech Processing

This CD reflects the cooperative efforts of 28 researchers who attend the 1993 summer workshop in speech processing hosted by the Center for Computer Aids for Industrial Productivity (CAIP) at Rutgers University and sponsored by the National Security Agency. The workshop was an outgrowth of summers at the Center for Communication Research in Princeton (CCR-P) and targeted problems concerning gerneral- purpose speech recognition with particular emphasis on front end processing. The project was held from July 6th to August 13th and utilized extensive computational resources: both equipment native to CAIP and additional hardware acquired for the workshop.

Documentation available.

Item Name:              Frontiers in Speech Processing

LDC Catalog No.:        LDC96S29

NIST Catalog No.:       #15

LDC Release date:       9/95

Nonmember price:        0

Special license:        NO

CTIMIT: Cellular TIMIT Speech Corpus

The CTIMIT corpus is a cellular-bandwidth adjunct to the TIMIT Acoustic Phonetic Continuous Speech Corpus (NITST Speech Disc CD1-1.1/NTIS Pb91-505065, October 1990). The corpus was contributed by Lockheed-Martin Sanders to the LDC for distribution on CD-ROM media.

The CTIMIT read speech corpus has been designed to provide a large, phonetically labeled database for use in the design and evaluation of speech processing systems operating in diverse, often hostile, cellular telephone environments. CTIMIT was collected by members of the Voice Communication Initiative (VCI) at Lockheed-Martin Sanders' Signal Processing Center of Technology (SPCOT) as part of internal R&D efforts, with additional sponsorship from the Wireless Communications Group in the company's Advanced Engineering and Technology (AE&T) Division.

Like NTIMIT, CTIMIT is based on the original TIMIT recordings, which were passed through a sample of actual telephone circuits---cellular circuits in the case of CTIMIT. Thus the original phonetic segmentation and labeling of TIMIT continue to be applicable to CTIMIT as well as NTIMIT.

Documentation available.


Item Name:                      CTIMIT

LDC Catalog No.:                LDC96S30

NIST Catalog No.:               NA

LDC Release date:               3/96

Nonmember price:                $100

Special license:                NO

FFMTIMIT: Far Field Microphone Recordings of the TIMIT Speech Corpus

The FFMTIMIT corpus contains the previously-unreleased secondary microphone waveforms for the TIMIT Acoustic-Phonetic Continuous Speech corpus. The primary microphone waveforms, which were recorded using a close-talking noise-cancelling head-mounted Sennheiser microphone (model HMD-414), are available from the LDC on NIST Speech Disc 1-1.1 (LDC93S1). The secondary microphone used in the recording of the TIMIT corpus was a Breul & Kjaer 1/2" free-field microphone (model 4165).

While the Sennheiser microphone recordings are relatively "clean" with respect to non-speech noise, the FFMTIMIT recordings includes significant low frequency noise, which was due to the HVAC system and mechanical vibration transmitted through the floor of the double-walled sound booth used in recording. Because it is noiser than its TIMIT counterpart, the data of FFMTIMIT may be used in the development of more noise-robust speech recognition systems. In addition, this data may be of value to researchers involved in vocal tract modeling because the B&K microphone has extremely flat free-field frequency response and calibration tones are provided.

Note that the B&K TIMIT data contained with this release has not been processed through any highpass filter, (e.g., the 1581-point filter described in the paper "The DARPA Speech Recognition Research Database" by Fisher, Doddington and Goudie-Marshall in "DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus CD-ROM," NISTIR 4930 / NTIS Order No. PB93- 173938.)


Item Name:                      FFMTIMIT

LDC Catalog No.:                LDC96S32

NIST Catalog No.:               21-1.1

LDC Release date:               5/96

Nonmember price:                $100

Special license:                NO

Text Corpora: Descriptions and Ordering Information

Association for Computational Linguistics Data Collection Initiative (ACL/DCI)

The ACL Data Collection Initiative disc contains text from: Wall Street Journal, copyright 1987, 1988, 1989, provided by Dow Jones, Inc.; the Collins English Dictionary, Copyright 1979, William Collins Sons Co., Ltd.; scientific abstracts provided by the U.S. Department of Energy; and a variety of gramatically tagged and parsed materials from the Treebank project at the University of Pennsylvania, copyright 1990,1991, University of Pennsylvania. The total amount of uncompressed text is 620 Mbytes.

The many formats in which the originals of these texts came have all, to one extent or another, been mapped into a markup language consistent with the SGML standard (ISO 8879).

The format of the material from the Wall Street Journal uses a labelled bracketing, expressed in the style of SGML, although no formal SGML DTD is provided. The tag set has been modified by turning the Dow Jones header categories into tags and by creating ad hoc tages such as ``.'' The original datelines are presented as separate text units; the text is divided and tagged into paragraphs and sentences with each sentence presented on a single line. Nothing has been done to modify the typographical methods used to subdivide headlines and stories into sections, nor are any of the text features within sentences (quotes, ellipsis, etc.) normalized.

The Collins English Dictionary is present in two forms. One form was approximately parsed into fielded records as an exercise in learning a language called ``FIT'', by a student working under the direction of Lloyd Nakatani at ATT Bell Laboratories during the summer of 1990. The original digital image of the typographer's tape that the database version was prepared from had serious flaws that were not detected and corrected until later; the corrected version, a clean typographer's tape, is presented in a separate directory. A properly-analyzed database version will be provided in the future. The documentation includes notes developed during the new attempt to analyze the tape from scratch.

The Department of Energy abstracts reside in files that are approximately one megabyte each. The original 950 separators have been replaced with newlines, and space padding between articles was removed. An acronym dictionary that was extracted from the database as an indication of the material's topic areas has been included in a separate directory.

Provisional material from the Penn Treebank project is divided into two subdirectories on this disk. The subdirectory ``postext'' contains text with part-of-speech annotations; ``parstext'' contains text with syntactic bracketing.

README file is available.

Item Name:		ACL/DCI  

LDC Catalog No.:  	LDC93T1

NIST Catalog No.: 	NA  

LDC Release date:	4/92 (MY93) 

Nonmember price: 	100

Special license:	YES  

The Penn Treebank Project - Release 2.

Original treebank release

This CD-ROM contains over 1.6 million words of hand-parsed material from the Dow Jones News Service, plus an additional 1 million words tagged for part-of-speech. This material is a subset of the corpus for the current DARPA large-vocabulary speech recognition project.

It also contains the first fully parsed version of the Brown Corpus, which has also been completely retagged using the Penn Treebank tag set. Also included are tagged and parsed data from Department of Energy abstracts, IBM computer manuals, MUC-3, and ATIS.

In addition, the CD-ROM includes source code for several software packages, including tgrep, which permits the user to search for specific constituents in tree structures.

Release - 2

The Penn Treebank Project Release 2 CDROM features the new Penn Treebank II bracketing style, which is designed to allow the extraction of simple predicate/argument structure. Over one million words of text are provided with this bracketing applied, along with a complete style manual explaining the bracketing, and new versions of tools for searching and treating bracketed data. This CDROM also contains all the annotated text material from the earlier Treebank Preliminary Release, including the Brown Corpus. While these materials have not all been converted to the newer bracketing style, they have been cleaned up to remove problems that had appeared in the earlier release.

The contents of Treebank Release 2 are as follows:

In addition, the Penn Treebank Project will be providing updates, announcements and a discussion forum for users. A file of updates and further information available via anonymous ftp from, in pub/treebank/doc/update.cd2. This file will also contain pointers to a gradually expanding body of relatively technical suggestions on how to extract certain information from the corpus.

Detailed questions about the corpus may be sent to, while questions and requests for obtaining Treebank Release 2 should be sent to Further information is available.

Item Name:              PENN TREEBANK - 2

LDC Catalog No.:        LDC95T7

NIST Catalog No.:       NA

LDC Release date:       2/95 (MY95)

Nonmember price:        2500

Special license:        NO

TIPSTER Information Retrieval Text Research Collection

The TIPSTER project is sponsored by the Software and Intelligent Systems Technology Office of the Advanced Research Projects Agency (ARPA/SISTO) in an effort to significantly advance the state of the art in effective document detection (information retrieval) and data extraction from large, real-world data collections.

The detection data is comprised of a new test collection built at NIST to be used both for the TIPSTER project and the related TREC project. The TREC project has many other participating information retrieval research groups, working on the same task as the TIPSTER groups, but meeting once a year in a workshop to compare results (similar to MUC). The test collection built at NIST consists of 3 disks (gigabytes) of documents, 150 topics, and the answers (relevant documents) for those topics.

The documents in the test collection are varied in style, size, and subject domain. The first disk contains material from the Wall Street Journal (1986, 1987, 1988, 1989), the AP Newswire (1989), the Federal Register (1989), information from Computer Select disks (Ziff-Davis Publishing), and short abstracts from the Department of Energy. The second disk contains information from the same sources, but from different years. The third disk contains more information from the Computer Select disks, plus material from the San Jose Mercury News (1991), more AP newswire (1990), and about 250 megabytes of formatted U.S. Patents. The format of all the documents is relatively clean and easy to use, with SGML-like tags separating documents and document fields. There is no part-of-speech tagging or breakdown into individual sentences or paragraphs as the purpose of this collection is to test retrieval against real-world data.

The three Tipster discs so far released have been re-issued with updates and corrections, and all recipients of the earlier versions should have received these replacements free of charge. If you think you have the unrevised original, contact LDC for confirmation.

README file can be reached from here.

Item Name:              TIPSTER Complete

LDC Catalog No.:        LDC93T3

NIST Catalog No.:       NA

LDC Release date:       4/92 (MY93)

Nonmember price:        2500

Special license:	YES  

TIPSTER Volume 1, March 1992

Directory Name & Description

/ap          Associated Press Newswire material, copyright 1989

/fr          Federal Register material, 1989

/wsj         Wall Street Journal, copyright 1987, 1988, 1989

/doe         Department of Energy abstracts

Item Name:		TIPSTER vol.1 

LDC Catalog No.:  	LDC93T3-1.1  

NIST Catalog No.: 	NA  

LDC Release date:	4/92 (MY93) 

Nonmember price: 	1000  

Special license:	YES  

TIPSTER Volume 2, July 1992

Directory Name & Description

/ap          Associated Press Newswire material, copyright 1988

/fr          Federal Register, 1988

/wsj         Wall Street Journal, copyright 1990, 1991, 1992

/ziff        Ziff-Davis Publishing, copyright 1989, 1990

/doe         Department of Energy abstracts

Item Name:		TIPSTER vol.2  

LDC Catalog No.:  	LDC93T3-2.1   

NIST Catalog No.: 	NA  

LDC Release date:	7/92 (MY93) 

Nonmember price: 	1000  

Special license:	YES  

TIPSTER Volume 3, April 1993

Directory Name & Description

/ap          Associated Press material, copyright 1990

/patents     U.S.Patent documents, 1983-1991

/sjm         San Jose Mercury News, copyright 1991

Item Name:		TIPSTER vol.3  

LDC Catalog No.:  	LDC93T3-3.1   

NIST Catalog No.: 	NA  

LDC Release date:	7/92 (MY93) 

Nonmember price: 	1000  

Special license:	YES  

United Nations Parallel Text Corpus (English, French, Spanish)

This set of three compact discs contains documents provided to the LDC by the United Nations, for use in research on machine translation technology. The documents come from the Office of Conference Services at the UN in New York, and are drawn from archives that span the period between 1988 and 1993.

This publication contains the English, French and Spanish archives, with data from each language stored on a separate disc in the set. Care has been taken to arrange the document files in a parallel directory structure for each language, so that corresponding translations of a document are found directly by means of the directory paths and file names.

All parallel files in this corpus are English-based: for every file on the English disc, there will be a corresponding file on either the French or Spanish disc, or both. Tables are included on all discs to assist in determining which parallels are present. Due to the nature and organization of UN translation services and the original electronic text archives, the process of finding and sorting out parallel documents yielded a numerous gaps, with many files in each language having no parallel in other languages.

In preparing the text for publication, we have applied a fully-compliant SGML format (Standard Generalized Markup Language). For those researchers who use SGML, a working DTD (Document Type Definition) is provided on each disc. For those who do not need SGML markup, a simple script is included that can be used to filter out the SGML-specific material, and leave only the plain text. The character set used is the 8-bit ISO 8859-1 Latin1, in which accented letters and some other non-ASCII characters occupy the upper 128 entries of the character table.

README file.


LDC Catalog No.:  	LDC94T4A

NIST Catalog No.: 	NA  

LDC Release date:	4/94 (MY94) 

Nonmember price: 	2500  

Special license: YES  


LDC Catalog No.:  	LDC94T4B-1    

NIST Catalog No.: 	NA  

LDC Release date:	4/94 (MY94) 

Nonmember price: 	1000  

Special license: YES


LDC Catalog No.:  	LDC94T4B-2    

NIST Catalog No.: 	NA  

LDC Release date:	4/94 (MY94) 

Nonmember price: 	1000  

Special license: YES


LDC Catalog No.:  	LDC94T4B-3.1    

NIST Catalog No.: 	NA  

LDC Release date:	4/94 (MY94) 

Nonmember price: 	1000  

Special license: YES

Japanese Business News Text

The Linguistic Data Consortium announces the availability of a Japanese language text corpus composed of business and financial news from two sources:

  1. Approximately 30 million words of text have been made available from the morning edition of Nihon Kezai Shimbun, the largest Japanese financial news daily newspaper; the release this year covers all text that was published during 1994.

    The data was received at the LDC on 9-track magnetic tape; the character encoding was EBCDIC, but was standardized to EUC, which the LDC has chosen as its standard for Japanese.

  2. A smaller part of the corpus comes from Dow Jones Telerate, which markets its Japanese Language Service. This is a financial newswire produced by Kyodo News Service; its recipients are primarily managers of Japanese owned corporations, or Japanese employees working in North American brokerage houses, banking, etc. The text is received at the LDC via a digital transmission service installed by Telerate; special software was written by the LDC to poll a central database and download articles individually. The character encoding is EUC.

The copyright holders of this text have requested that it be made available to LDC members only. Inquiries about the corpus or requests for it, or information about becoming members for the 1995 membership year should be directed to

Further information about the LDC and its available corpora can be accessed on the Linguistic Data Consortium WWW Home Page at URL Information is also available via ftp at under pub/ldc; for ftp access, please use "anonymous" as your login name, and give your email address when asked for password.


Item Name:              Japanese Business News Text

LDC Catalog No.:        LDC95T8

NIST Catalog No.:       NA

LDC Release date:       7/95

Nonmember price:        Members Only

Special license:        YES

Spanish News Text Collection

The Spanish News Corpus consists of journalistic text data from one newspaper (El Norte, Mexico) and from the Spanish-language services of three newswire sources: Agence France Presse, Associated Press Worldstream, and Reuters. (The Reuters collection comprises two distinct services: Reuters Spanish Language News Service and Reuters Latin American Business Report.)

All text data are stored on one CD-ROM, in a standard compressed form. The fours sets of newswire data (AFP, APWS, and two Reuters services) are each organized as one data file per day of collection. The period covered by these collections runs from December 1993 (for APWS and Reuters) or May 1994 (APWS) through December 1995. (The El Norte data, provided to us by INFOSEL Mexico, are arbitrarily grouped into files of about 1 megabyte in size when uncompressed; date information is not available for individual articles, but the general period of the collection is 1993.)

The approximate amounts of data per source (when uncompressed) is indicated below (in total megabytes and millions of words of text):

       Source   MB      MW


        AFP     345     44

        APWS    253     33

        REUSL   333     41

        REULA   233     23

        INFOSEL 209     31

The presentation of text data in these collections is modeled on the TIPSTER corpus. Within each data file, SGML tagging is used (1) to mark article boundaries, (2) to delimit the text portion within each article, and (3) to label various pieces of information about the article that are external to the text content (e.g. headlines, bylines, and so on).

The copyright holders of this text have requested that it be made available to LDC members only. Due to the release date this corpus is available to 1995 and 1996 members. In order to obtain this corpus, current LDC members must submit a signed User Agreement Form. Documentation available.

Item Name:              Spanish News Text Collection

LDC Catalog No.:        LDC95T9

NIST Catalog No.:       NA

LDC Release Date:       3/96

Nonmember price:        Members Only

Special license:        YES


The first release of the European Corpus Initiative, the Multilingual Corpus 1 (ECI/MCI), has 46 subcorpora in 27 (mainly European) languages. The total size of these is roughly 92 million (lexical) words. The corpora are marked up using TEI P2 conformant SGML (to varying levels of detail), with easy access to the source text without markup. Twelve of the component corpora are multilingual parallel corpora with from two to nine sub-corpora. All the alphabetic corpora (there is some Japanese and Chinese) are encoded in the ISO LATIN family of 8-bit character sets (ISO 8859-1, -5 and -7). The CD-ROM is in High Sierra format (ISO 9660), readable on UNIX, MSDOS and Apple systems at least.

The amount of material per language varies, from about 36 million words (German) to about 5 thousand words (Bulgarian). The majority of sources are journalistic in nature (newspapers, magazines, broadcasts); additional sources include dictionaries (Albanian, Gaelic, Turkish, Japanese/English), literature, technical reports, and proceedings or publications of international organizations. The table on the next page lists the languages included, the subcorpus numbers for each language (in parentheses), and the amount of data per language in thousands of lexical words.

Language    (Subcorpus #) Kwords 			          Totals 

German      (70)         34291  (09)   191  (65)   20  (28)  187 

            (29)         59     (30)    76  (47)   24  (59)   50 

            (71)         21    (70A)   999  	                  35918

French      (31)         4775   (04)  4121  (28)  187  (29)   59 

            (30)         76     (47)    24  (51)    6  (59)   50 

            (71)         21     (32)  1667                        10986

Spanish     (31)         4500   (13)   830  (14) 1041  (15)  447 

            (47)         24     (32)  1667    8  (59)   50  (71)  8580

English     (31)         4222   (36)  1141  (74)  95   (28)  187 

            (47)         24     (51)     6  (56)  97   (59)   50 

            (71)         21     (32)  1667 			  7510

Dutch       (03)         5500   (02)   600  (47)  24   (71)   21  6145

Czech       (44)         4726 					  4726

Italian     (11)  	 3518   (42)   303  (58)  13   (29)   59 

            (30)         76     (47)    24  (71)  21        	  4014

Chinese     (78)         2895 					  2895

Greek       (10)         2515   (47)    24  (59)  50   (71)   21  2610

Norwegian   (41)         2226  					  2226

Swedish     (37)         1718					  1718

Serb/Croat/Slov(24)      700    (56)   289  	                  989

Tibetan     (76)         834  					  834

Portuguese  (60)         675    (47)    24  (71)  21 		  720

Malay       (80)         563					  563

Russian     (73)         364 					  364

Japanese    (57)         203 					  203

Turkish     (20)         173   (20A)  110			  283

Albanian    (82)         205 					  205

Gaelic      (55)         141 					  141

Estonian    (39)         100 					  100

Usbek       (81)          88  					  88

Latin       (74)          75 					  75

Danish      (47)          24    (71)   21 			  45

Lithuanian  (89)          20					  20

Bulgarian   (84)           5					  5 

Total      							  91969

Click here to see the README file

Item Name:		ECI/MCI  

LDC Catalog No.:  	LDC94T5

NIST Catalog No.: 	NA  

LDC Release date:	6/94 (MY94) 

Nonmember price: 	35  

Special license:	YES  

Lexical Databases: Descriptions and Ordering Information

CELEX Lexical Database

This corpus contains ASCII versions of the CELEX lexical databases of English (version 2.5), Dutch (version 3.1) and German (version 2.0). CELEX was developed as a joint enterprise of the University of Nijmegen, the Institute for Dutch Lexicology in Leiden, the Max Planck Institute for Psycholinguistics in Nijmegen, and the Institute for Perception Research in Eindhoven. Pre-mastering and CD-ROM production was done by the LDC.

For each language, this CD-ROM contains detailed information on :

The databases have not been tailored to fit any particular database management program. Instead, the information is in ASCII files in a UNIX directory tree that can be queried with tools such as AWK or ICON. Unique identity numbers allow the linking of information from different files. Some kinds of information have to be computed on-line; wherever necessary, AWK functions have been provided to recover this information. README files specify the details of their use.

A detailed User Guide describing the various kinds of lexical information available is supplied. All sections of this guide are POSTSCRIPT files, except for some additional notes on the German lexicon in plain ASCII.


The second release of CELEX contains an enhanced, expanded version of the German lexical database (2.5), featuring approximately 1000 new lemma entries, revised morphological parses, verb argument structures, inflectional paradigm codes, and a corpus type lexicon. A complete PostScript version of the Germanic Linguistic Guide is also included, in both Eouropean A-4 format and American Letter format. For German, the total number of lemmas included is now 51,728, while all their inflected forms number 365,530.

Moreover, phonetic syllable frequencies have been added for (British) English and Dutch. Apart from this, and provision of frequency information alongside every lexical feature, no changes have been made to Dutch and English lexicons.

Complete AWK-scripts are now provided to compute representations not found in the (plain ASCII) lexical data files, corresponding to the features described in CELEX User Guide, which is included on the CD as well.

For each language, i.e. English, German, and Dutch, the CD-ROM contains detailed information on the orthography (variations in spelling, hyphenation), the phonology (phonetic transcriptions, variations in pronunciation, syllable structure, primary stress), the morphology (derivational and compositional structure, inflectional paradigms), the syntax (word class, word-class specific subcategorisation, argument structures), and word frequency (summed word and lemma counts, based on resent and representative text corpora) of both wordforms and lemmas. Unique identity numbers allow the linking of information from different files with the aid of an efficient, index-based C-program.

Like its predecessor, the CD-ROM is mastered using the ISO 9660 daa format, with the Rock Ridge extensions, allowing it to be used in VMS, MS-DOS, Macintosh and UNIX environments. As the new release does not omit any data from the first edition, the current release will replace the old one.


Item Name:            CELEX-2

LDC Catalog No.:      LDC96L19

NIST Catalog No.:     NA  

LDC Release date:     12/95 (MY96) 

Nonmember price:      150  

Special license:      YES  

COMLEX: COMmon LEXical Database of English

This is a three-part project: COMLEX English Syntax, COMLEX English Pronunciation, and COMLEX English Semantics. The first two have resulted in electronic dictionaries, released by LDC as MY94 products and described below.

The Semantics will result in an annotated corpus using WordNet, which is a public domain compendium of lexical semantic relations, in 1995. Annotation of the same corpus using COMLEX Syntax is also planned for 1995.

For a description of WordNet, see George Miller (ed.), WordNet: An on-line lexical database, in International Journal of Lexicography (special issue), 3(4):235-312, 1990, or George Miller, Claudia Leacock, Randee Tengi, and Ross Bunker: A semantic concordance, in Proceedings of the Human Language Technology Workshop, pages 303--308, Princeton, NJ, March 1993.

These products are intended to provide a comprehensive set of lexical resources for research and development in computational linguistics. They will be revised and expanded continuously, with feedback from the community of users, and current members will receive all new versions.

The initial (MY94) versions of the electronic dictionaries are being distributed only by ftp. Contact LDC for instructions to obtain license forms and the dictionaries.

COMLEX English Syntax.

This is a moderately broad coverage English lexicon (with about 38,000 lemmas) developed at New York University under LDC sponsorship. It contains detailed information about the syntactic characteristics of each lexical item, and is particularly detailed in its treatment of subcategorization (complement structures). It includes 92 different subcategorization features for verbs, 14 for adjectives, and 9 for nouns. These features distinguish not only the different constituent structures which may appear in a complement, but also the different control features associated with a constituent structure.

Version 0, released in August 1994, is available by ftp to members who sign a license agreement, which is also found on the LDC ftp site.

Some references for the syntax and semantics work:

Ralph Grishman, Catherine Macleod, and Adam Meyers. Comlex syntax: Building a computational lexicon. To appear in Proc. 15th Int'l Conf. Computational Linguistics (COLING 94), Kyoto, Japan, August 1994.

Item Name:    	    COMLEX English Syntax Lexicon, Version 1.1.1 

LDC Catalog No.:    LDC94L2 , LDC95L4, LDC95L6

NIST Catalog No.:   NA  

LDC Release date:   6/94 (MY94) 

Nonmember price:    10,000 

Special license:    YES  

COMLEX English Pronunciation

The COMLEX English Pronunciation Dictionary, also known as PRONLEX, was first released in July 1994 as Version 0, and in revised form as Version 0.1 in February 1995. Version 0 contained 30,354 entries with representations of one or more citation pronunciations each, covering essentially the WSJ30K vocabulary. Version 0.1 contains 66,135 entries, adding coverage of WSJ64K and SWITCHBOARD. [WSJ30K and WSJ64K are word lists selected from several years of Wall Street Journal texts used in recent ARPA Continuous Speech Recognition corpora. SWITCHBOARD is a three million word corpus of telephone conversations on a variety of topics. All are available from LDC.]

The PRONLEX documentation, which is accessible by anonymous ftp, describes the principles observed for word transcription (see the file PRONUNCIATION). Although predictable variation in pronunciation due to dialect or variable reduction has not been notated, the documentation notes systematic dialectal variants which may be generated by rule. In addition, alternate pronunciations are given for words whose pronunciation varies by part of speech (e.g., abstrAct, Abstract), or in less systematic but salient ways (especially names). Classes of exceptions to the transcription principles, such as names, function words, and foreign words, are tagged, as described in the PRONUNCIATION file.

PRONLEX is a dynamic enterprise, intended to enhance the research capabilities of the entire LDC community with publicly accessible resources of high quality and broad utility at reasonable cost. Its success depends on members providing feedback in the form of corrections, additions, comments, and suggestions for improvement. Please see the README file for instructions. PRONLEX Version 0.1 was created under the direction of Cynthia McLemore at the Linguistic Data Consortium, with research assistant Paul Kingsbury coordinating transcription activities. License forms available by ftp in either postscript or latex form, at, in the directory pub/ldc/license\_forms. LDC members receive PRONLEX free; nonmembers may purchase a research-use license only.

Item Name:          COMLEX Pronouncing Dictionary, Version 0.1

LDC Catalog No.:    LDC94L3, LDC95L5, LDC96L7

NIST Catalog No.:   NA  

LDC Release date:   6/94 (MY94) 

Nonmember price:    10,000 

Special license:    YES  

COMLEX English Pronunciation Version 0.2

The COMLEX English Pronunciation Dictionary Version 0.2, also known as PRONLEX Version 0.2, released in July 1995, is a 90,694 word pronouncing dictionary of English, including WSJ30K, WSJ64K, Switchboard, and additional lemmas from COMLEX syntax. (WSJ30K and WSJ64K are word lists selected from several years of Wall Street Journal texts used in recent ARPA Continuous Speech Recognition corpora. Switchboard is a three million word corpus of telephone conversations on a variety of topics.)

PRONLEX is available by ftp to members who sign a license agreement, which is also found on the LDC ftp site.

The PRONLEX documentation describes the principles observed for word transcription. Although predictable variation in pronunciation due to dialect or variable reduction has not been notated in the lexicon itself, the documentation notes systematic dialectal variants, which may be generated by rule. In addition, alternate pronunciations are given for words whose pronunciation varies by part of speech (e.g., abstrAct, Abstract), or in less systematic but salient ways (especially names). Classes of exceptions to the transcription principles, such as names, function, words, and foreign words, are tagged.

PRONLEX Version 0.2 was created under the direction of Cynthia McLemore at the Linguistic Data Consortium, with research assistant Paul Kingsbury coordinating transcription activities.



Item Name:  		COMLEX Pronouncing Lexicon, Version 0.2 

LDC Catalog No.: 	LDC95L3,LDC96L7 

NIST Catalog No.:  	NA 

LDC Release date:  	7/95 (MY95) 

Nonmember price: 	10,000 

Special license: YES