Each section below describes a corpus or a set of corpora. They are listed by type (speech, text, lexicons, other) and within type by the membership year (MY) in which they were released by LDC. In the case of series or sets of corpora, each individual corpus or segment is described in a separate subsection. The descriptions are brief rather than complete; more information can be found by clicking on "README" or "Documentation" iconets.
In the catalog, each description is followed by six items of information:
Example 1: The United Nations Parallel Text Corpus was published in March of 1994, thus in membership year (MY) 94, consists entirely of text (T), and was assigned text corpus number 4. It comprises three discs: the first contains English texts, the second the corresponding French texts, and the third the corresponding Spanish. They are available either (A) as a set of three or (B) separately. Thus LDC94T4A refers to the UN corpus as a whole, LDC94TB-1 to the English disc alone, LDC94T4B-2 to the French alone, and LDC94T4B-3 to the Spanish alone.
Shortly after release, the Spanish disc was found to have a manufacturing defect and was replaced with a new one, so if there is need to refer to the them individually, the original is now called 3.0 and the replacement 3.1.
Example 2: The second Continuous Speech Recognition corpus, collected in 1993 and distributed in early 1994, was assigned corpus number 8. It contains 14 discs of speech recorded over a Sennheiser HMD414 microphone (a de facto standard in ARPA evaluations); 15 discs with the same speech recorded over another microphone; and 5 discs containing unique (unpaired) data: speech recorded only once, transcriptions, test or evaluation data, etc., much of which is also needed to make full use of the paired speech recordings. To satisfy customer preferences, the corpus is offered by LDC in three configurations: (A) the complete corpus of 34 discs; (B) the ``Sennheiser corpus,'' i.e., the whole corpus minus the ``other microphone'' data, on 19 discs; and (C) the ``other microphone'' corpus, i.e., the whole corpus minus the Sennheiser data, 20 discs. These are designated as follows:
CSR-II Complete: LDC94S13A, consisting of LDC94S13A-1 through LDC94S13A-34
CSR-II Sennheiser: 19 discs, LDC94S13B, consisting of LDC94S13B-1 through S13B-7, S13B-11, S13B-13 through S13B-16, S13B-18 through S13B-21, and S13B-32 through S13B-34
CSR-II Other: 20 discs, LDC94S13C, consisting of LDC94S13C-8 through S13C-10, S13C-12 through S13C-14, S13C-17, S13C-22 through S13C-34
The following are the procedures and conditions for obtaining corpora from the LDC:
For LDC Members:
LDC membership is annual, with the membership year (MY) running from 1 September to 31 August. Each LDC corpus is identified by the MY of its release and membership fees purchase a paid-up license to that MY's LDC corpora.
Members receive one copy of each requested LDC corpus at no charge; there may be charges for corpora owned or produced by others and distributed by LDC.
Members may also purchase extra "convenience copies" of LDC corpora, at $100 per disk or the catalog price, for use at approved sites. These convenience copies are subject to the same restrictions and covered by the same license, if any, as the primary copies.
Notices will be mailed to all members when new data sets are available. When corpora are re-issued in revised, enhanced, or supplemented form, unless the reason is defective materials, they will be distributed only to those whose LDC membership is current in the MY of re-issue. Nonmembers who wish to receive upgrades must pay the nonmember price for the re-issue.
At this time it is no longer possible to purchase a 1993 or 1994 membership. Members who are in good standing (i.e. current members) may purchase corpora from these memership years at the rate of $100 per CD-ROM.
As an incentive to purchasers of 1996 memberships, the following corpora from MY 93 and 94 are being offered in the 96 MY; Resource Management Complete, CSR I, HCRC Map Task Corpus, and BRAMSHILL. These corpora will be distributed on a first come first served basis to those paying for 1996 memberships. Corpora that are in limited quantity will be indicated by a number in parentheses following the LDC Catalog number, which represents the quantity on hand (As of March 29, 1996)
The cost of membership is $2,000 for nonprofit and government institutions and $20,000 for commercial organizations. To requesr two copies of the required membership agreement please call 1 (215) 898-0464 or send email to ldc@unagi.cis.upenn.edu. You can also retrieve them here.
For Nonmembers:
Prices are subject to change; the prices below are effective until December 31, 1996. Nonmembers add a shipping charge for each order: $30 US and Canada, $50 overseas.
PLANNED 1997 RELEASES (TENTATIVE)
Price Set-of Description Release Date
or Catalog #
TBA 14 Corpus of Spoken American English Spring 1997
TBA 1 English Language Internat. News Fall 1996
TBA 3 JURIS: Legal Text (500 M words) Fall 1996
TBA 15 SWITCHBOARD (Revised) Fall 1996
PLANNED 1996 RELEASES (TENTATIVE)
Price Set-of Description Release Date
or Catalog #
750 6 Resource Management Complete Set LDC93S3A (28)
1000 15 CSR-I (WSJ0) Complete LDC93S6A
200 8 HCRC Map Task Corpus LDC93S12 (47)
500 9 BRAMSHILL LDC94S20 (12)
10000 1 COMLEX English Syntax Lexicon 111 LDC96L6
10000 1 COMLEX Pronounciation Dictionary LDC96L7
0 1 Frontiers in Speech Processing LDC96S29 (45)
150 1 CELEX-2 LDC96L19*
MO 1 Spanish Text Collection LDC95T9*
100 1 CTIMIT LDC96S30
100 1 FFMTIMIT LDC96S32
2500 3 CSRIV Radio Broadcast News: Hub4 LDC96S31
MO 3 CSRIV: Hub 3 LDC96S33
MO 2 N. American Business News Text Summmer 1996*
2500 1 Mandarin Business News Text Summmer 1996*
MO 1 European Language Newspaper Text Summmer 1996*
5000 2 Hansard Summmer 1996*
TBA 14 JEIDA Japanese Speech Data Fall 1996
TBA 1 Mandarin Lexicon Fall 1996
TBA 1 Spanish Lexicon Fall 1996
TBA 6 POLYPHONE-II (American Spanish) Summer 1996
TBA 2 Mandarin Telephone Speech Summer 1996
TBA 2 Spanish Telephone Speech Summer 1996
TBA 6 CALLFRIEND Language ID Corpus Fall 1996
TBA 2 Speaker ID Evaluation Test(SWB) Summer 1996
TBA DCIEM/HCRC Map Task Spring 1996
TBA 1 WBUR Radio Speech Corpus Summer 1996
1995 RELEASES
Price Set-of Description Release Date
or Catalog #
2500 1 KING Speaker Verification LDC95S22
MO 3 CSR-III Speech: Dev and Eval Data LDC95S23
MO 4 CSR-III Text: Language Model LDC95T6
2000 6 WSJCAM0: Cambridge Read News LDC95S24
2500 1 TRAINS dialog corpus LDC95S25
2000 2 ATIS3: Test Data LDC95S26
5000 3 PHONEBOOK: NYNEX Isolated Words LDC95S27
2500 1 Treebank-2 LDC95T7
MO 1 Japanese Business News Text LDC95T8
2000 2 LATINO-40 Spanish Read News Corp LDC95S28
MO 1 Spanish Text Collection LDC95T9*
MO 2 N. American Business News Text Spring 1996*
2500 1 Mandarin Business News Text Spring 1996*
MO 1 European Language Newspaper Text Spring 1996*
5000 2 Hansard French/English Spring 1996*
10000 1 COMLEX English Syntax Lexicon 111 LDC95L4
10000 1 COMLEX Pronounciation Dictionary LDC95L5
150 1 CELEX-2 LDC96L19*
*Available to 1995 and 1996 members
1994 RELEASES
Price Set-of Description LDC Catalog # 1500 34 CSR-II (WSJ1) Complete LDC94S13A 1250 8 Air Traffic Control LDC94S14 (23) 2000 2 SPIDRE LDC94S15 (35) 750 1 YOHO Speaker Verification LDC94S16 200 1 OGI Multilanguage Corpus LDC94S17 100 1 OGI Spelled Spoken Word LDC94S18 2500 3 ATIS3 LDC94S19 500 9 BRAMSHILL LDC94S20 (12) 10000 8 MACROPHONE (American English) LDC94S21 2500 3 UN Parallel Text (Complete) LDC94T4A 1000 1 UN Parallel Text (English) LDC94T4B-1 1000 1 UN Parallel Text (French) LDC94T4B-2 1000 1 UN Parallel Text (Spanish) LDC94T4B-3.1 35 1 ECI Multilingual Text LDC94T5 10000 1 COMLEX English Syntax Lexicon, V.0 LDC94L2 10000 1 COMLEX Pronouncing Dictionary, V.0 LDC94L31993 RELEASES
Price Set-of Description LDC Catalog # 100 1 TIMIT LDC93S1 (13) 250 2 NTIMIT LDC93S2 750 6 Resource Management Complete Set LDC93S3A (28) 500 6 ATIS0 Complete Corpora Set LDC93S4A (31) 1000 4 ATIS2 LDC93S5 (46) 1000 15 CSR-I (WSJ0) Complete LDC93S6A 10000 28 SWITCHBOARD LDC93S7 (7) 1000 1 SWITCHBOARD Credit Card LDC93S8 (3) 125 1 TI 46-Word LDC93S9 (26) 250 3 TIDIGITS LDC93S10 (8) 200 8 HCRC Map Task Corpus LDC93S12 (47) 100 1 ACL/DCI LDC93T1 2500 1 Tipster Complete LDC93T3 1000 1 TIPSTER Volume 1 LDC93T3-1.1 1000 1 TIPSTER Volume 2 LDC93T3-2.1 1000 1 TIPSTER Volume 3 LDC93T3-3.1
TIMIT Acoustic-Phonetic Continuous Speech Corpora
The TIMIT corpus of read speech is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. TIMIT contains broadband recordings of 630 speakers of 8 major dialects of American English, each reading 10 phonetically rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic, and word transcriptions as well as a 16-bit, 16kHz speech waveform file for each utterance. Corpus design was a joint effort among the Massachusetts Institute of Technology (MIT), SRI International (SRI), and Texas Instruments, Inc. (TI). The speech was recorded at TI, transcribed at MIT, and verified and prepared for CD-ROM production by the National Institute of Standards and Technology (NIST).
The TIMIT corpus transcriptions have been hand verified. Test and training subsets, balanced for phonetic and dialectal coverage, are specified. Tabular computer-searchable information is included as well as written documentation.
Original ARPA-sponsored Version (TIMIT)
This is the original 16 kHz version, recorded over a high quality microphone in studio conditions.
README file is available.
Item Name: TIMIT LDC Catalog No.: LDC93S1 NIST Catalog No.: 1-1 Release date: 10/90 (MY93) Nonmember price: $100 Special license: NONYNEX Telephone Version of TIMIT Corpus (NTIMIT)
The NTIMIT corpus was developed by the NYNEX Science and Technology Speech Communication Group to provide a telephone bandwidth adjunct to TIMIT.
NTIMIT was collected by transmitting all 6300 original TIMIT recordings through a telephone handset and over various channels in the NYNEX telephone network and redigitizing them. The recordings were transmitted through ten Local Access and Transport Areas, half of which required the use of long-distance carriers.
In order to calibrate the transmission characteristics of the various channels, stationary 1 kHz and frequency-sweeping tones were also recorded for each of the transmission channels. These are found on disc 2.
The re-recorded waveforms were time-aligned with the original TIMIT waveforms so that the TIMIT time-aligned transcriptions can be used with the NTIMIT corpus as well. In additiont to the documentation on the disc, see Jankowski et al., "NTIMIT: A Phonetically Balanced, Continuous Speech, Telephone Bandwidth Speech Database," Proc. ICASSP-90, April 1990. NYNEX retains full copyright on the corpus and all associated materials.
README file is available.
Item Name: NTIMIT LDC Catalog No.: LDC93S2 NIST Catalog No.: 10-1.1, 10-2.1 LDC Release date: 8/92 (MY93) Nonmember price: $250 Special license: NO
The DARPA Resource Management Continuous Speech Corpora (RM) consist of digitized and transcribed speech for use in designing and evaluating continuous speech recognition systems. There are two main sections, often referred to as RM1 and RM2. RM1 contains four CD-ROMs, two with Speaker-Dependent (SD) training data, one with Speaker-Independent (SI) training data, and one with test and evaluation data. RM2 has 2 CD-ROMs with an additional and larger SD data set, including test material.
All RM material consists of read sentences modeled after a naval resource management task. The complete corpus contains over 25,000 utterances from more than 160 speakers representing a variety of American dialects. The material was recorded at 16KHz, with 16-bit resolution, using a Sennheiser HMD-414 headset microphone. All discs conform to the ISO-9660 data format.
RM sentences are consistent with a limited language model with a 1000 word vocabulary that allows queries about ships, ports, etc., along with commands to control a graphics display system, but little else. There is no "official" language model, but a simple non-probabilistic word-pair grammar that provides complete coverage of the sentences in this corpus is provided.
The Resource Management text corpus was designed at BBN Laboratories, Inc. and SRI International. BBN also developed and made available the "Word-Pair" grammar that has been used in the benchmark tests. Texas Instruments, Inc. recruited the subjects and recorded and digitized the speech. For more information about the design and collection of this corpus see: P. Price, W.M. Fisher, J. Bernstein and D.S. Pallett, "The DARPA 1000-Word Resource Management Database for Continuous Speech Recognition", Proceedings of the 1988 International Conference on Acoustics, Speech and Signal Processing (Paper S.13.21, pp. 651- 654).
A series of benchmark speech recognition performance assessment tests were conducted beginning in March 1987 using this corpus in conjunction with standardized scoring software. For more information see D.S. Pallett, "Benchmark Tests for DARPA Resource Management Database Performance Evaluations", in Proceedings of the 1989 International Conference on Acoustics, Speech and Signal Processing (Paper S10.b.6, pp. 536-539) and related papers in the Proceedings of the February 1989, October 1989, June 1990, and February 1991 DARPA Speech and Natural Language Workshops.
Resource Managment SD and SI Training and Test Data (RM1)
The first two CD-ROMs contain Speaker-Dependent (SD) Training Data: 12 subjects, each reading a set of 600 "training sentences", 2 "dialect" sentences, and 10 "rapid adaptation" sentences, for a total of 7344 recorded sentence utterances. The 600 sentences designated as training cover 97 of the lexical items in the corpus.
The third CD-ROM contains the Speaker-Independent (SI) Training Data: 80 speakers each read the 2 "dialect" sentences plus 40 sentences from the Resource Management text corpus, for a total of 3360 recorded sentence utterances. Any given sentence from a set of 1600 Resource Management sentence texts was recorded by two subjects, while no sentence was read twice by the same subject.
The fourth CD-ROM contains all SD and SI system test material used in 5 DARPA benchmark tests conducted in March and October of 1987, June 1988, and February and October 1989, along with scoring and diagnostic software and documentation for those tests. Documentation is also provided outlining use of the Resource Management training and test material at CMU in development of the SPHINX system. Example output and scored results for state-of-the-art speaker-dependent and speaker-independent systems (i.e., the BBN BYBLOS and CMU SPHINX systems) for the October 1989 benchmark tests are included, as well as SPeech HEader REsources (SPHERE) software and SPHERE-to-SAM conversion software.
README file is available.
Extended Resource Management Speaker-Dependent Corpus (RM2)
This 2-disc set forms a speaker-dependent extension to the Resource Management (RM1) corpus. The corpus consists of a total of 10,508 sentence utterances (2 male and 2 female speakers each speaking 2,652 sentence texts). These include the 600 "standard" Resource Management speaker-dependent training sentences, 2 dialect calibration sentences, 10 rapid adaptation sentences, 1800 newly-generated extended training sentences, 120 newly-generated development-test sentences, and 120 newly-generated evaluation-test sentences. The evaluation-test material on the discs was used as the test set for the June 1990 DARPA SLS Resource Management Benchmark Tests (see the Proceedings.)
The RM2 corpus was recorded at Texas Instruments. The NIST speech recognition scoring software originally distributed on the RM1 "Test" Disc was adapted for RM2 sentences, and is included on these discs as well as the SPHERE speech file header manipulation software.
README file is available.
Item Name: RM Complete LDC Catalog No.: LDC93S3A NIST Catalog No.: 2-1.1 through 2-4.2, 3-1.2 and 3-2.2 LDC Release date: MY93 Nonmember price: 750 Special license: NOAir Travel Information System (ATIS) Corpora
During 1989 and 1990, the DARPA Spoken Language Systems (SLS) Program initiated plans for development of a "common corpus" for both speech recognition and natural language research, using "spontaneous goal-directed" speech, rather than "read speech." The common task domain that was chosen is termed the "Air Travel Information System" (ATIS). The corpora developed to date in order to train and test systems in this domain are known as ATIS0, ATIS2, and ATIS3. (ATIS1 will not be published.)
In all the ATIS corpora, users make spoken inquiries to simulated (ATIS0) or prototypical (ATIS2, ATIS3) speech understanding systems to obtain air travel information. The system has the information in the form of a relational database derived from the Official Airline Guide; the initial ATIS0 relational database, for example, contains information relevant to travel among 9 major airports serving 11 cities. To measure performance, the system's answers to the spoken inquiries are expressed in a logical form known as the "canonical answer specification" (CAS) language, and compared with canonical answers reviewed by human experts. There are thus a number of auxiliary files associated with each utterance, including orthographic transcriptions and, for answerable queries, ``reference answers''.
Texas Instruments developed ATIS0, the pilot corpus for this program, using a "Wizard of Oz" technique to simulate an ATIS SLS. (See Hemphill, Godfrey and Doddington's paper ``The ATIS Spoken Language Systems Pilot Corpus'' in the Proceedings of the June 1990 DARPA Speech and Natural Language Workshop.)
Since 1991, the data for ATIS2 and ATIS3 have been collected at multiple sites and pooled for common use. The number of speakers and utterances, the coverage of the travel information database, the collection scenarios and platforms, have all changed as documented in each corpus section.
For further information on the ATIS domain, on the test paradigm, and on ATIS-domain benchmark tests, see the Proceedings of the DARPA Speech and Natural Language Workshops held in October 1989, June 1990 and February 1991. (Morgan Kaufman, Publishers, Inc., 2929 Campus Drive, San Mateo, CA 94403. ISBN numbers: 1-55860-112-0, 1-55860-157-0, and 1-55860-207-0.)
ATIS0 Spontaneous Speech Pilot Corpus and Relational Database
The ATIS0 Corpus totals 6 CD-ROMs: one with spontaneous data from 36 speakers; one with read versions of the data from 20 of those speakers, along with some adaptation material; and four with extensive speaker dependent material from the ATIS domain, read by 10 of the same speakers.
All ATIS speech data is recorded at 16kHz sample rate, 16 bit quantization, from two different microphones, a close-talking (Sennheiser HMD414) and a desk-top (Crown PCC-160) model.
The first disc (ATIS0 Pilot) contains spontaneous utterances elicited in a "Wizard-of-Oz" simulation, along with the relational database containing the travel information (excluding connecting flights). Thirty-six speakers produced a total of 912 utterances.
The second disc (ATIS0 Read) contains ``read'' versions of the spontaneous utterances for 20 of the 36 speakers above, for a total of 478 productions. This is supplemented by a set of 40 ``adaptation'' sentences read by each of the 20 speakers.
The third through the sixth discs (ATIS0 SD-Read) contain "read" speech in the ATIS domain for ten of the speakers on the first disc. They read a total of 3171 utterances, or approximately 317 utterances per speaker. This data was collected for the purpose of training speaker-dependent speech recognition systems for the ATIS0 domain. Two of these four discs contain the close-talking (Sennheiser) microphone data, and the other two contain corresponding data for the desk-top (Crown PCC-160) microphone. Thus there are 6342 waveform files on the four discs.
README file can be reached here.
The entire ATIS0 set of six discs is now offered at a reduced price:
Item Name: ATIS0 Complete LDC Catalog No.: LDC93S4A NIST Catalog No.: 5-1.1 through 5-6.1 LDC Release date: 4/94 (MY93) Nonmember price: 1000 Special license: NOATIS2
The ATIS2 corpus, on four CD-ROMs, contains approximately 15,000 utterances recorded from approximately 450 subjects at five sites: ATT, BBN, CMU, MIT's Laboratory for Computer Science, and SRI. All utterances are been transcribed and almost 10,000 of them annotated with categorizations and canonical reference answers. Unlike the ATIS0 corpus, much of the data in ATIS2 was collected using partially or fully-automated data collection systems. The fully-automated data collection systems were, in fact, working ATIS prototypes.
For ATIS2, the 10-city relational database of ATIS0 was revised to accommodate connecting flights and fares and some table headings were renamed.
In addition to training data, the February and November '92 ATIS Benchmark Tests are included as well. Each contains approximately 1,000 utterances from the pool of data collected by the five sites.
Documentation is available.
Item Name: ATIS2 LDC Catalog No.: LDC93S5 NIST Catalog No.: 12-1.1 through 12-4.1 LDC Release date: 4/92 (MY93) Nonmember price: 1000 Special license: NOATIS3 Training Data
The ATIS3 corpus, on three CD-ROMs, includes over 774 scenarios completed by 137 subjects, yielding a total of over 7,300 utterances. All utterances are transcribed and 2,900 of them have been categorized and annotated with canonical reference answers.
The relational database for this dataset included flight information for 46 cities and 52 airports. Data was collected at BBN, CMU, MIT, and SRI, using their own ATIS systems, and at NIST using systems provided by BBN and SRI.
Two 1000-utterance test sets were set aside from the data pooled by the collection sites. The first set was used in a December 1993 ARPA test, and is included in ATIS3. The second has been reserved for future testing.
Documentation is available.
Item Name: ATIS3-1 LDC Catalog No.: LDC94S19 NIST Catalog No.: 17-1.1 through 17-3.1 LDC Release date: 8/94 (MY94) Nonmember price: 2500 Special license: NOATIS3-Test Data
This set of discs contains a corpus of speech and natural language data collected under the auspices of the Advanced Research Projects Agency Spoken Language Systems (ARPA-SLS) technology development program. The corpus, which contains data in the Air Travel Information Services (ATIS) domain, was designed by the ARPA-SLS Multi-Site Atis Data COllection Working (MADCOW) group and was collected by five sites at locations across the U.S.:
BBN Systems & Technologies, Cambridge, MA
Carnegie Mellon University, Pittsburgh, PA
MIT Laboratory for Computer Science, Boston, MA
National Institute of Standards and Technology, Gaithersburg, MD
SRI International, Menlo Park, CA
The corpora on this set of discs is part of the third phase of collection of ATIS data (ATIS3) and comprises the development test (NIST Speech Disc 17-4.2) and evaluation test material (NIST Speech Disc 17-5.1) used in the December 1994 ARPA SLS Benchmark Tests. As in the previous ATIS corpora, the speech contained in this corpus was elicited by presenting subjects with various hypothetical travel planning scenarios to solve. The resulting spontaneous spoken queries were recorded as the subjects interacted withpartially or completely automated ATIS systems to solve the scenarios. Note that the ATIS3 training data is available on NIST Speech Discs 17-1.1-17-3.1.
The recorded speech has been transcribed and annotated with categorizations and canonical reference answers.All of the utterances on these discs have been recorded using a close-talking, noise-canceling head-mounted Sennheiser microphone. For some subjects, secondary (noisier) microphone data was recorded simultaneously as well.
These discs also contains the ATIS3 46 city/52 airport relational database, a revised Principles of Interpretation, and test implementation and scoring instructions as well as other general documentation.
The ATIS3 corpus has been verified, collated, documented and produced on CD-ROM by the National Institute of Standards and Technology (NIST) in cooperation with MADCOW and distributed by the Linguistic Data Consortium (LDC).
Documentation is available.
Item Name: ATIS3-2 LDC Catalog No.: LDC95S26 NIST Catalog No.: #17-4.2 through 17-5.1 LDC Release date: 7/95 Nonmember price: $2000 Special license: NOContinuous Speech Recognition (CSR) Corpora sponsored by ARPA
During 1991, the DARPA Spoken Language Program initiated efforts to build a new corpus to support research on large-vocabulary Continuous Speech Recognition (CSR) systems.
The first two CSR Corpora consist primarily of read speech with texts drawn from a machine-readable corpus of Wall Street Journal news text, and are thus often known as WSJ0 and WSJ1. (Later sections of the CSR set of corpora, howver, will consist of read texts from other sources of North American business news, and eventually from other news domains.)
The texts to be read were selected to fall within either a 5,000-word or a 20,000-word subset of the WSJ text corpus. (See the documentation for details.) Some spontaneous dictation is included in addition to the read speech. The dictation portion was collected using journalists who dictated hypothetical news articles.
Two microphones are used throughout: a close-talking Sennheiser HMD414, and a secondary microphone, which may vary. The corpora are thus offered in three configurations: the speech from the Sennheiser, the speech from the other microphone, and the speech from both; all three sets include all transcriptions, tests, documentation, etc.
In general, transcriptions of the speech, test data from ARPA evaluations, scores achieved by various speech recognition systems, and software used in scoring are included on separate discs from the waveform data.
ARPA Continuous Speech Recognition Corpus I: Wall Street Journal Sentences (WSJ0, or CSR-I)
MIT's Laboratory for Computer Science, SRI International and Texas Instruments collected approximately 40 hours of speech and over 31,000 utterances. Prompts were taken from the Wall Street Journal.
Development and evaluation test sets are included and so marked.
Documentation is available.
Item Name: CSR-I Complete LDC Catalog No.: LDC93S6A NIST Catalog No.: 11-1.1 through 11-12.1, 11-14.1, 11-15.1 LDC Release date: 7/93 (MY93) Nonmember price: 1,000 Special license: NOARPA Continuous Speech Recognition Corpus II: Wall Street Journal Sentences (WSJ1, or CSR-II)
The complete WSJ1 corpus contains approximately 78,000 training utterances ( 73 hours of speech), 4,000 of which are the result of spontaneous dictation by journalists with varying degrees of experience in dictation. The corpus contains approximately 8,200 "conventional" development test utterances ( 8 hours of speech), 6,800 of which are from spontaneous dictation. As with the pilot corpus, the entire corpus was collected using 2 microphones, so the amount of speech in the entire corpus is about 162 hours.
In early 1993, a "Hub and Spoke" test paradigm was designed, calling for eleven test sets, each a specific variation of the basic or ``hub'' condition. The eleven Hub and Spoke Development and Evaluation Test sets each contain approximately 7500 waveforms ( 11 hours of speech).
WSJ1 waveforms have been compressed by about 2:1 using the SPHERE-embedded ``Shorten'' compression algorithm developed at Cambridge University.
Documentation is available.
Item Name: CSR-II Complete LDC Catalog No.: LDC94S13A NIST Catalog No.: 13-1.1 through 13-34.1 LDC Release date: 7/93 (MY94) Nonmember price: 1,500 Special license: NO1994 Benchmark Speech Test Collection for the ARPA Continuous Speech Recognition Program (CSR-III Speech)
The third ARPA Continuous Speech Recognition (CSR) Benchmark Speech Test Collection is a three CD-ROM set that contains complete development test and evaluation test suites for speaker-independent, large-vocabulary speech recognition systems.
The development and evaluation tests share a common structure, consisting of two core test components ("hubs") and seven specialized test components ("spokes"). The hub tests, which were mandatory for all ARPA CSR participants in the November '94 evaluations, provide a base- line for ASR performance, while the spokes provide the means for assessing the impact of particular speaking conditions or processing strategies in relation to baseline performance. Participants were free to take any combination of spoke tests according to their research interests). Taken together, the collection encompasses 180 speakers, each producing twenty to forty sentences. These are organized into two complete development test sets and one evaluation set.
The collection also includes complete documentation on the test specifications, data collection procedures, transcriptions, and scoring protocols, together with the latest available version of NIST software for scoring ASR results and managing SPHERE waveform files. All speech data is accompanied by both the prompting texts and the detailed orthographic transcriptions of the utterances.
This was the first ARPA CSR Benchmark Test in which prompting texts were drawn from a variety of news sources. Whereas earlier benchmarks were based on Wall Street Journal excerpts (from the period 1987-89), CSR-III prompts come a variety of North American Business News Services: Reuters News Service, New York Times, Wahington Post and Los Angeles Times as well as WSJ; all texts are drawn from financial news articles written during the period of April through June, 1994. (NAB stands for "North American Business", in contrast to earlier benchmarks and training collections labeled "WSJ".)
An important companion to the 1994 Benchmark Speech data collection is the 4-disk CSR-III Text Collection, which includes the ARPA CSR 1994 Standard Language Model. The collection comprises both source text data (prepared by LDC and BBN) and derived statistical tables (compiled by CMU) of unigram, bigram and trigram word frequencies. The sources include all available WSJ texts, spanning 1987 through March 1994, and all AP and San Jose Mercury news data from the three TIPSTER volumes. (Some of the WSJ data, from 1992 through 1994, appears here for research use for the first time.) This corpus is also available from the LDC as a 1995 release.
Because of restrictions imposed by the copyright holders of much of the NAB text, both the speech and text collections are available to LDC members only. For more information on how to join, send email to ldc@unagi.cis.upenn.edu.
README file is available.
Item Name: CSR-III Text:Language Model. LDC Catalog No.: LDC95T6 NIST Catalog No.: NIST22-1.1-22-4.1,23-1-1,25-1.1-25-3.1 LDC Release date: 2/95 (MY95) Nonmember price: M/O Special license: NO Item Name: CSR-III Speech: Development and Evaluation Data. LDC Catalog No.: LDC95S23 NIST Catalog No.: NIST22-1.1-22-4.1,23-1-1,25-1.1-25-3.1 LDC Release date: 2/95 (MY95) Nonmember price: M/O Special license: NODARPA Continuous Speech Recognition Corpus-IV: Radio Broadcast News (CSRIV Hub-4)
This set of CD-ROMs contains all of the speech data provided to sites participating in the DARPA CSR November 1995 Hub-4 (Radio) Broadcast News tests. The data consists of digitized waveforms of MarketPlace (tm) business news radio shows provided by KUSC through an agreement with the Linguistic Data Consortium, and detailed transcriptions of those broadcasts. The software NIST used to process and score the output of the test systems is also included.
The data is organized as follows:
CD26-1: Training Data-Ten complete half-hour broadcasts with minimally-verified transcripts. The transcripts are time aligned with the waveforms at the story-boundary level.
CD26-2: Development-Test Data-Six complete half-hour broadcasts with verified transcripts. The transcripts are time aligned with the waveforms at the story-and turn-boundary level. Index files have been included which specify how the data may be partitioned into 2 test sets.
CD26-6 Evaluation-Test Data-Five complete half-hour broadcasts with verified/adjudicated transcripts. The transcripts are time aligned with the waveforms at the story-, turn-, and music-boundary level. An index file has been included which specifies how the data was partitioned into the test set used in the CSR 1995 Hub-4 tests.
Item Name: CSR-IV (Hub 4) LDC Catalog No.: LDC96S31 NIST Catalog No.: NIST26-1.1-26-2.1,26-6-1 LDC Release date: 5/96 (MY96) Nonmember price: $2500 Special license: YESDARPA Continuous Speech Recognition Corpus IV: (CSR-IV Hub-3)
This set of CD-ROMs contains all of the speech data provided to sites participating in the DARPA CSR November 1995 Hub-3 Mulit-Microphone tests. The data consists of digitized waveforms collected with eight different microphones simultaneously from 40 subjects reading 15 sentence articles drawn from various North American business news publications. The data is partitioned into development-test and evaluation-test sets. The test sets were collected with different subjects, prompts, and microphones. No training data was collected for this corpus since a substantial amount of NAB acoustic training data was already available. Index files have been included that specify the exact subset of the evaluation test recordings which were used in the November 1995 tests. The software NIST used to process and score the outputof the tests systems is also included.
The data is organized as follows:
CD26-3 Development-Test Data-Location 1, Adaptation and NAB recordings, Subjects:703-705, 707-70a, 70c, 70f, 70g
CD26-4 Development-Test Data-Location 2, NAB recordings, Subjects:70k, 70m, 70o, 70q-70s, 70u-70w
CD26-5 Development-Test Data-Location 2, Adaptation recordings, Subjects:70k 70m-70o, 70q-70s, 70u-70w
CD26-3 Development-Test Data-NAB recordings, Subjects:710-71j
Item Name: CSR-IV Hub 3) LDC Catalog No.: LDC96S33 NIST Catalog No.: NIST26-3.1, 26-4.1, 26-5-1, 26-7.1 LDC Release date: 6/96 (MY96) Nonmember price: MO Special license: YESSWITCHBOARD Corpus of Recorded Telephone Conversations
SWITCHBOARD is a collection of about 2400 two-sided telephone conversations among 543 speakers (302 male, 241 female) from all areas of the United States. A computer-driven "robot operator" system handled the calls, giving the caller appropriate recorded prompts, selecting and dialing another person to take part in a conversation, introducing a topic for discussion, and recording the speech from the two subjects into separate channels until the conversation was finished. About 70 topics were provided, of which about 50 were used frequently. Selection of topics and callers was constrained so that: (1) no two speakers would converse together more than once, and (2) no one spoke more than once on a given topic.
Waveform files were recorded into two channels directly from the T1 digital telephone circuits, at an 8kHz sample rate and 8-bit mu-law quantization. Complete orthographic transcriptions were made for each conversation, with codes to identify overlapping portions (both speakers talking at the same time), certain non-speech events (laughter, coughs, etc), and interruptions/hesitations. Each conversation was also rated by transcribers for various quality factors (amount of cross-talk between channels, static and background noise, topicality, etc). In addition, each transcription was verified, and then used in a forced speech-recognition algorithm to establish timing marks for word and utterance boundaries; transcriptions are provided in the corpus in both "plain text" and "time-aligned" forms. A description is published in the 1993 ICASSP Proceedings: Godfrey, McDaniel, and Holliman, ``SWITCHBOARD: A Telephone Speech Corpus for Research and Develpment.''
The original issue of SWITCHBOARD in early 1993 lacked about 150 conversations which were intended for publication but omitted by error. They were published in May 1994 and distributed to all previous recipients of SWITCHBOARD.
The Switchboard Corpus was collected at Texas Instruments and produced on CD-ROM at the National Institute of Standards and Technology. It is distributed in a notebook-style binder with 28 CD-ROMs, (27 containing speech data, and one containing all transcription data). Preparation of the data for CD-ROM production was done by NIST. The waveform files use the NIST SPHERE format.
README file is available.
Item Name: SWITCHBOARD LDC Catalog No.: LDC93S7 NIST Catalog No.: 9-1.1, 9-3.1 through 9-29.1 LDC Release date: 4/92 (MY93) Nonmember price: 10000 Special license: NOSWITCHBOARD Corpus Excerpts, Credit Card Conversations
This CD-ROM contains 35 conversations on the topic of ``Credit Card Use''. Most but not all can also be found in the Switchboard Corpus (see below). The conversations can be used in training and testing wordspotting systems. In addition to 2-channel mu-law encoded audio waveform files, the disc contains transcriptions, time-alignments, and wordspotting targets.
README file is available.
Item Name: SWITCHBOARD Credit Card LDC Catalog No.: LDC93S8 NIST Catalog No.: 8-1.2 LDC Release date: 5/92 (MY93) Nonmember price: 1000 Special license: NOTexas Instruments 46-Word Speaker-Dependent Isolated Word Corpus (TI46)
This CD-ROM contains a corpus of speech which was originally designed and collected at Texas Instruments, Inc. (TI) in 1980, and used initially in performance assessment tests of isolated-word speaker-dependent technology. (See ``Speech Recognition: Turning Theory to Practice'' by G. R. Doddington and T. B. Schalk, in IEEE Spectrum, Vol. 18, No. 9, September 1981.)
The 46-word vocabulary consists of two sub-vocabularies: (1) the TI 20-word vocabulary (consisting of the digits zero through nine plus the words "enter", "erase", "go", "help", "no", "rubout", "repeat", "stop", "start", and "yes", and (2) the TI 26-word "alphabet set" (consisting of the letters "a" through "z").
The corpus contains read utterances from 16 speakers (8 males and 8 females) each speaking 26 utterances of the 46-word vocabulary: 16 tokens designated as training and 10 as test.
The corpus was collected at Texas Instruments in a quiet acoustic enclosure using an Electro-Voice RE-16 Dynamic Cardiod microphone at 12.5kHz sample rate with 12-bit quantization. The files are in NIST SPHERE format, and have a ".wav" filename extension.
README file can be reached here.
Item Name: TI 46 Word LDC Catalog No.: LDC93S9 NIST Catalog No.: 7-1.1 LDC Release date: 4/92 (MY93) Nonmember price: 125 Special license: NOTexas Instruments Speaker-Independent Connected-Digit Corpus (TIDIGITS)
This three-disc set contains speech which was originally designed and collected at Texas Instruments, Inc. (TI) for the purpose of designing and evaluating algorithms for speaker-independent recognition of connected digit sequences. There are 326 speakers (111 men, 114 women, 50 boys, and 51 girls) each pronouncing 77 digit sequences. Each speaker group is partitioned into test and training subsets.
The corpus was collected at TI in 1982 in a quiet acoustic enclosure using an Electro-Voice RE-16 Dynamic Cardiod microphone, digitized at 20kHz. The waveform files are in the NIST SPHERE format.
README file is available.
Item Name: TIDIGITS LDC Catalog No.: LDC93S10 NIST Catalog No.: 4-1, 4-2, 4-3 LDC Release date: 4/92 (MY93) Nonmember price: 250 Special license: NOThe HCRC Map Task Corpus
The Map Task Corpus is a set of 8 CD-ROMs containing a total of about 18 hours of spontaneous speech that was recorded from 128 two-person conversations, involving 64 different speakers (32 female, 32 male, all adults, each taking part in four conversations). The 64 speakers were all students at the University of Glasgow, 61 of them being native Scots. The conversations were carried out in an experimental setting, in which each participant has a schematic map in front of them, not visible to the other. Each map is comprised of an outline and roughly a dozen labelled features (e.g. "a white cottage", "an oak forest", "Green Bay", etc). Most features are common to the two maps, but not all. One map has a route drawn in, the other does not. The task is for the participant without the route to draw one on the basis of discussion with the participant with the route. In addition to the conversations, each speaker provides a wordlist reading, consisting of the major vocabulary items contained in the conversations.
The experimental design allows a number of different phonemic, syntactico-semantic and pragmatic contrasts to be explored in a controlled way. In particular, maps and feature names were designed to allow for controlled exploration of phonological reductions of various kinds in a number of different referential contexts, and to provide, via varying patterns of matches and mis-matches between the two maps, a range of different stimuli for referent negotiation. Also the conditions of the conversations were carefully balanced: In half of them the talkers were strangers, in half friends; in half of them the talkers could see each other's faces, in half they could not.
The waveform data are provided in "raw" (headerless) files (16-bit samples, 20 kHz sample rate, 2 channels per conversation), and alternative header files are provided for use with software based on either the NIST ``SPHERE'' header structure or the European ``SAM'' header structure. Text transcriptions are provided for each conversation, along with PostScript files of the map images used in the experiments. Additional materials include full documentation of the experimental design and data collection protocol, resources for using SGML tools on the transcriptions and other text materials, and an extensive set of source code for performing basic signal processing functions on the waveform data, such as down-sampling, de-multiplexing, channel summation, and D/A conversion for Sun workstations (including playback of segments selected via inspection of transcripts in Emacs).
README file is available.
Item Name: HCRC MAP TASK LDC Catalog No.: LDC93S12 NIST Catalog No.: NA LDC Release date: 4/92 (MY93) Nonmember price: 200 Special license: NOAir Traffic Control Corpus (ATC0)
The Air Traffic Control Corpus (ATC0) is an eight-disc set of recorded speech for use in supporting research and development activities in the area of robust speech recognition in domains similar to air traffic control (several speakers, noisy channels, relatively small vocabulary, constrained languaged, etc.) The audio data on these discs is composed of voice communication traffic between various controllers and pilots.
The audio files are 8 KHz, 16-bit linear sampled data, representing continuous monitoring, without squelch or silence elimination, of a single FAA frequency for one to two hours. There are also files which indicate the amplitude of the received AM carrier signal at 10 msec. intervals.
Full transcripts, including the start and end times of each transmission, are provided for each audio file. Each flight is identified by its flight number.
ATC0 consists of three subcorpora, one for each airport in which the transmissions were collected -- Dallas Fort Worth (DFW), Logan International (BOS), and Washington National (DCA). The complete set contains approximately 70 hours of controller and pilot transmissions collected via antennas and radio receivers which were located in the vicinity of the respective airports.
Detailed information regarding the collection process and the equipment used can be found on each disc in the file, ``atc.doc'' in the ``doc'' directory.
The ATC0 Corpus was collected by Texas Instruments under contract to ARPA. It was produced on CD-ROM by the National Institute of Standards and Technology for distribution by the Linguistic Data Consortium.
README file is available.
Item Name: AIR TRAFFIC CONTROL LDC Catalog No.: LDC94S14 NIST Catalog No.: 16-1.1 through 16-8.1 LDC Release date: 3/94 (MY94) Nonmember price: 1250 Special license: NOSPIDRE Speaker Identification Corpus
This is 2-CD subset of the SWITCHBOARD collection (see above), selected for speaker ID research, and with special attention to telephone instrument variation. It contains training and testing data for experiments in closed or open set recognition or verification. Combining the two sides of the conversations also permits speaker change detection, or speaker monitoring, experiments.
There are 45 ``target'' speakers; four conversations from each target are included, of which two are from the same handset. There are also 100 calls in which no target appears. Since all conversations are two-sided, this results in 180 target sides and 180 + 200 = 380 nontarget sides.
Except for truncations of a few longer calls at 5 minutes, the call themselves are as described under SWITCHBOARD.
Item Name: SPIDRE LDC Catalog No.: LDC94S15 NIST Catalog No.: 18-1.1 and 18-2.1 LDC Release date: 4/94 (MY94) Nonmember price: 2000 Special license: NOYOHO Speaker Verification Corpus
The YOHO database is a three-disc set containing a large scale, high-quality speech corpus to support text-dependent speaker authentication research, such as is used in "secure access" technology. The data was collected in 1989 by ITT under a US Government contract, but has not been available for public use before. Note that certain changes have been made to the corpus, mainly to insure the privacy of the speakers, and some data has been withheld by the government for future use in testing.
YOHO contains:
README file is available.
Item Name: YOHO LDC Catalog No.: LDC94S16 NIST Catalog No.: NA LDC Release date: 4/94 (MY94) Nonmember price: 750 Special License: NOOGI Multi-Language Corpus
The corpus consists of responses to prompts spoken over commercial telephone lines by speakers of English, Farsi(Persian), French, German, Hindi, Japanese, Korean, Mandarin Chinese, Spanish, Tamil and Vietnamese. It contains a total of 1927 calls, an average of 175 calls per language.
Speech was collected using an automated system that answered the telephone, played digitized prompts in the appropriate language to request the speech samples, and digitized the callers' responses for a designated period of time.
Log files are included that provide a set of automatic measurements made on each utterance. In addition, some utterances were automatically segmented into broad phonetic catagories. The speech data are compressed, with NIST SPHERE headers.
To read the README file click here
Item Name: OGI MULT-LANGUAGE TELEPHONE LDC Catalog No.: LDC94S17 NIST Catalog No.: NA LDC Release date: 4/94 (MY94) Nonmember price: 200 Special license: NOOGI Spelled and Spoken Telephone Corpus
The OGI Spelled and Spoken Telephone Corpus consists of speech recordings from over 3650 telephone calls, each made by a different speaker to an automated prompting/recording system installed at the Oregon Graduate Institute. Speakers were asked to say their name, where they were calling from, and where they grew up; they were asked to answer a couple of yes/no questions, and to spell their first and last names; many were also asked to repeat a few specific words, and to recite the letters of the alphabet.
Each response to a prompt is stored as a separate waveform file, and the files are organized according to prompt (response type); all responses from a given call have a unique caller-index number as part of the file named, so that responses can easily be sorted by speaker. Waveform data are stored in compressed form, using the NIST SPHERE 2.0 software package, which is available separately at no charge to users. SPHERE 2.0 provides the decompression software needed to extract the waveform data, as well as tools for accessing and modifying file headers.
Time-aligned phonetic transcriptions are provided for a subset of responses, and a complete log of each (giving speaker sex, quality judgments, and orthographic transcriptions of all responses) is included in a form suitable for use as a relational data base.
README file is available.
Item Name: OGI SPELLED SPOKEN WORD LDC Catalog No.: LDC94S18 NIST Catalog No.: NA LDC Release date: 4/94 (MY94) Nonmember price: 100 Special license: NOBRAMSHILL
The recordings on this nine-disc set were originally made in 1978-79 as part of a British Home Office study into speaker identification techniques. Subsequently, it was realised that a large body of unconstrained conversational material might be of interest to researchers working in other speech processing fields. The recordings were transcribed and the CD-ROMs prepared during 1993.
The recordings were made at the Police Staff College, Bramshill, Hampshire, England. The participants were police officers taking part in the various courses at the college. This provided a wide range of regional accents and a range of ages from late teens to early fifties. Each speaker is described by nine demographic attributes.
Three adjacent bedrooms were used. The two participants, each alone in their rooms, conversed by telephone. The third room was used as a monitoring and recording station.
In addition to the telephone recordings, reference recordings were made using a high quality dynamic microphone in each room. It is these higher quality recordings, not the telephone speech, which are provided on the BRAMSHILL CD-ROM set.
The recordings were made on a Sony Elcaset EL-7 cassette machine, chosen at the time because of its good speed stability. The microphone was a Shure SM-7 cardioid type. The speech data was sampled at 10 kHz, 16-bit resolution.
Some attempt was made to control the acoustic environment. It is evident from listening to the recordings that, while these measures produced a reasonable recording environment, the rooms were far from soundproof. A variety of external noises (engines, aircraft, etc) can be heard on some of the recordings.
Each speaker was given a pile of photographs. In response to a bleep signal, each speaker introduced himself by name and read a set of test sentences. After this, the main part of the conversation took place, in which participants were asked to determine which of each pair of photographs has been taken first (if indeed they were related at all). The conversations continued for 10 minutes until terminated by another bleep signal.
During the digitisation process, some periods of silence were removed, so some recordings now appear to be shorter than the original ten minutes. Furthermore, this means that recordings of two sides of a conversation are no longer time-aligned. In addition, to preserve the anonymity of the speakers, some passages (mainly the introductions) have been erased by replacing with binary zeroes. Finally the bleep signals have also been erased with binary zeroes. The transcriptions indicate where this has occurred.
The speech was transcribed verbatim. No attempt was made to correct grammar, fill in missing words etc. Transcription conventions are detailed in the documentation. Every lexical word from the transcriptions is contained in the dictionary supplied in the INDEX directory. There are about 6500 word types in the 600k words of the transcripts. Contractions, part-words, slang words, hesitation sounds and the non-speech sounds such are all treated as words in their own right in the dictionary.
Item Name: BRAMSHILL LDC Catalog No.: LDC94S20 NIST Catalog No.: NA LDC Release Date: 8/94 (MY94) Nonmember price: 500 Special license: NOMACROPHONE
MACROPHONE consists of approximately 200,000 utterances by 5000 speakers. It is designed to provide material sufficient and suitable for research, development, and evaluation of automatic speech recognition technology for common telephone applications, such as shopping, transportation, database access, and autodialing. In addition to application-oriented phrases and numerous digit strings, seven sentences are spoken by each talker to provide ensemble phoneme, diphone and triphone coverage of the language. The spoken material also refers to times, locations, monetary amounts, spellings, and interactive operations.
The utterances were collected automatically over the telephone network by recording directly from a T1 connection in 8 kHz, 8-bit mu-law format. The participants, roughly equal numbers of males and females, were solicited by a marketing firm from all regions of the United States. They ranged in age from the teens to the seventies, and represented a broad range of educations and incomes as well. Each recorded utterance is accompanied by an orthographic transcription which also notes any unusual acoustic events or anomalies. Macrophone is the American English contribution to an international database of telephone speech corpora called POLYPHONE. Similar data sets are expected for major languages of the world, and at least some of these will be made available through LDC. Prospects are currently good for American Spanish (by early 1995), Dutch, Standard French, Standard German, Japanese, Mandarin Chinese, Swiss French, and Danish versions of POLYPHONE, all with basically the same structure and methods of collection.
MACROPHONE was collected at SRI under LDC sponsorship. A paper describing it was presented at ICASSP-94: ``Macrophone: An American English Telephone Speech Corpus for the POLYPHONE Project,'' by Jared Bernstein, Kelsey Taussig, and Jack Godfrey.
README file is available.
Item Name: MACROPHONE LDC Catalog No.: LDC94S21 NIST Catalog No.: NA LDC Release date: August 1994 (MY94) Nonmember price: 10000 Special license: NOThe KING-92 Corpus for Speaker Verification Research
The KING corpus was collected at ITT in 1987 under a US government research contract, and although other contractors have received it, it has not been officially available for public use before now. The version now available from LDC, referred to as KING-92, is based on a 1992 reprocessing of the original recordings (see below). It contains recorded speech from 51 male speakers in two versions, which differ in channel characteristics: one from a telephone handset and one from a high-quality microphone. The speakers are further subdivided into two groups, 25 in one and 26 in the other, who were recorded at different locations. For each speaker and channel there are ten files, corresponding to sessions of about 30 to 60 seconds' duration each. The interval between sessions varies from a week to a month. The transcripts contain about 54k word tokens (4.8k types).
KING is designed principally for closed set experiments in text-independent speaker identification or verification over toll-quality telephone lines, although the single-sided collection format does not permit simulation of real telephone traffic. The ten sessions allow for a variety of divisions into training and test data, with the possibility of multiple test sets. For example, one could examine the effects of the amount of training on performance, or examine the variability of performance over several test samples (sessions) given a fixed amount of training (but see below about the "Great Divide".)
The collection method used in KING was to establish a call from a laboratory location at ITT (either San Diego, CA or Nutley, NJ) over long distance lines and back to another phone at the same location. The phones used by the test subjects were equipped with an additional microphone, so two parallel recordings were made of that side of the conversation, while the interlocutor's side was not recorded. The two parties either spoke spontaneously or carried out a variety of tasks designed to elicit natural-sounding speech: interpreting a drawing, solving a problem, describing a picture, etc.
There were 25 speakers in Nutley and 26 in San Diego. Speech-to-noise ratios average about 10 dB worse for the Nutley telephone data than for San Diego; in fact it is less than 20 dB for over half the Nutley files. Users of this corpus therefore usually run separate experiments, or at least report results separately, according to site. A more subtle difference in the recordings, however, sometimes referred to as the ``Great Divide,'' cuts across the telephone data for the San Diego speakers. This was apparently due to a minor equipment change which was made during the collection; it results in a slight but consistent change in the average long term spectrum of the telephone data recorded after the fifth session. Training and testing on data from the same side of this divide gives significantly better results than across it. Since the discovery of this difference, investigators now generally report results on the first and last five sessions of the San Diego telephone KING data separately, or they report within vs. across this boundary. A detailed description of the spectral differences can be found in a report by Thomas Crystal and Ned Neuburg which accompanies the CD-ROM version.
Since there are a number of published papers with results based on the original KING corpus, and two versions of the data in existence, note that the new CD-ROM version, called KING-92, is based on a 1992 re-issue of the data from ITT. It differs from the original corpus in a few details:
Item Name: KING LDC Catalog No.: LDC95S2 NIST Catalog No.: NA LDC Release date: 4/95 (MY95) Nonmember price: $2500 Special license: NOWSJCAM0
A British English Speech Corpus for Large Vocabulary
Continuous Speech Recognition
(The Cambridge University Version of the ARPA CSR Corpus "WSJ0")
This release of WSJCAM0 on CD-ROM represents version 1.1 of the corpus, which was initially released on tape by Cambridge University as of 31 August, 1994. This collection is modelled directly on the initial ARPA CSR Corpus (WSJ0, a fifteen-disc corpus released by LDC in 1993): it uses the same dual-microphone recording paradigm and a subset of prompting texts drawn from the Wall Street Journal.
There are two key differences between WSJ0 and WSJCAM0: (1) the subjects in WSJCAM0 are native speakers of British English, and (2) in addition to standard orthographic transcripts, WSJCAM0 also has information on the time alignment between the sampled waveform and both the words and the phonetic segments.
The CD-ROM publication consists of six discs, with contents organized as follows:
Within the train and test sets, speech data are organized by speaker; prompting texts, detailed transcriptions and speaker information are included in each speaker directory.
All waveform files have NIST SPHERE headers; waveform data are compressed using the "Shorten" algorithm developed by Tony Robinson at Cambridge University, as adapted for use in the NIST SPHERE software package. (This package is available via anonymous ftp from NIST, on ftp server "jaguar.ncsl.nist.gov" in the "pub" directory.) Complete documentation is provided on each disc in the set.
Item Name: WSJCAM0 LDC Catalog No.: LDC95S24 NIST Catalog No.: NA LDC Release date: February 1995 (MY95) Nonmember price: 2000 Special license: NOThe TRAINS Spoken Dialog Corpus
This CD-ROM contains a corpus of task-oriented spoken dialogs. These dialogs were collected as part of the TRAINS project, a project to develop a conversationally proficient planning assistant, which helps a user construct a plan to achieve some task involving the manufacturing and shipment of goods in a railroad freight system. The collection procedure was designed to make the setting as close to human-computer interaction as possible, but was not a Òwizard?scenario, where one person pretends to be a computer. Thus these dialogs provide a snapshot into an ideal human-computer interface that would be able to engage in fluent conversations.
Altogether, this corpus includes 98 dialogs, collected using 20 different tasks and 34 different speakers. This amounts to six and a half hours of speech, about 5900 speaker turns, and 55000 transcribed words.
Item Name: TRAINS LDC Catalog No.: LDC95S25 NIST Catalog No.: NA LDC Release date: 5/95 Nonmember price: $2500 Special license: NOThe NYNEX Phonebook Database
PhoneBook is a phonetically-rich, isolated-word, telephone-speech database, created because of (1) the lack of available large-vocabulary isolated-word data, (2) anticipated continued importance of isolated-word and keyword-spotting technology to speech-recognition-based applications over the telephone, and (3) findings that continuous-speech training data is inferior to isolated-word training for isolated-word recognition.
The goal of PhoneBook is to serve as a large database of American English word utterances incorporating all phonemes in as many segmental/stress contexts as are likely to produce coarticulatory variations, while also spanning a variety of talkers and telephone transmission characteristics. We anticipate that it will be useful in ways analogous to TIMIT/NTIMIT.
The core section of PhoneBook consists of a total of 93,667 isolated-word utterances, totalling 23 hours of speech. This breaks down to 7979 distinct words, each said by an average of 11.7 talkers, with 1358 talkers each saying up to 75 words. All data were collected in 8-bit mu-law digital form directly from a T1 telephone line. Talkers were adult native speakers of American English chosen to be demographically representative of the U.S.
Given the large set of talkers being recruited for PhoneBook database, it made sense to exploit the opportunity to collect additional utterances. We have chosen spontaneous numerical utterances, because of widespread interest in them and the need for very large numbers of talkers for research into spontaneous- speech effects. We restricted to just three spontaneous digit sequences and one money amount, as the lists for the core of PhoneBook have been designed to approach the limit of reasonable duration for a caller's session. As a result, PhoneBook contains a total of 5105 spontaneous utterances.
Item Name: PHONEBOOK LDC Catalog No.: LDC95S27 NIST Catalog No.: NA LDC Release date: 7/95 Nonmember price: $5000 Special license: NOLATINO-40 Spanish Read News Corpus.
This database provides a set of recordins for training speaker-independent systems that recognize Latin-American Spanish. It was recorded by the Entropic Research Laboratory in the period from July 11 through September 9 1994 in Palo Alto, California. The database comprises about 5000 utterances files. These files include about 125 utterances from each of 40 different speakers, 20 male and 20 female.
The recordings were all made with a high-quality, head-mounted microphone (Shure SM10A) in an office environment, and the utterances were digitized in 16-bit samples at 16 kHz.
The Linguistic Data Consortium provided 13,000 sentences that had been selected from Latin American newspaper text by people working at Texas Instruments.9 The sentences are all shorter than 80 characters, and are not grouped into larger constituents like paragraphs or stories. The speach files have NIST SPHERE headers, and are presented in compressed format, using the "shorten" speech compression algorithm developed by Tony Robinson at Cambridge Univesity, as implemented in the NIST SPHERE software package. This software is included on the CD-ROM with the data.
Item Name: LATINO40 LDC Catalog No.: LDC95S28 NIST Catalog No.: NA LDC Release date: 11/95 Nonmember price: $2000 Special license: YESFrontiers in Speech Processing
This CD reflects the cooperative efforts of 28 researchers who attend the 1993 summer workshop in speech processing hosted by the Center for Computer Aids for Industrial Productivity (CAIP) at Rutgers University and sponsored by the National Security Agency. The workshop was an outgrowth of summers at the Center for Communication Research in Princeton (CCR-P) and targeted problems concerning gerneral- purpose speech recognition with particular emphasis on front end processing. The project was held from July 6th to August 13th and utilized extensive computational resources: both equipment native to CAIP and additional hardware acquired for the workshop.
Item Name: Frontiers in Speech Processing LDC Catalog No.: LDC96S29 NIST Catalog No.: #15 LDC Release date: 9/95 Nonmember price: 0 Special license: NOCTIMIT: Cellular TIMIT Speech Corpus
The CTIMIT corpus is a cellular-bandwidth adjunct to the TIMIT Acoustic Phonetic Continuous Speech Corpus (NITST Speech Disc CD1-1.1/NTIS Pb91-505065, October 1990). The corpus was contributed by Lockheed-Martin Sanders to the LDC for distribution on CD-ROM media.
The CTIMIT read speech corpus has been designed to provide a large, phonetically labeled database for use in the design and evaluation of speech processing systems operating in diverse, often hostile, cellular telephone environments. CTIMIT was collected by members of the Voice Communication Initiative (VCI) at Lockheed-Martin Sanders' Signal Processing Center of Technology (SPCOT) as part of internal R&D efforts, with additional sponsorship from the Wireless Communications Group in the company's Advanced Engineering and Technology (AE&T) Division.
Like NTIMIT, CTIMIT is based on the original TIMIT recordings, which were passed through a sample of actual telephone circuits---cellular circuits in the case of CTIMIT. Thus the original phonetic segmentation and labeling of TIMIT continue to be applicable to CTIMIT as well as NTIMIT.
Item Name: CTIMIT LDC Catalog No.: LDC96S30 NIST Catalog No.: NA LDC Release date: 3/96 Nonmember price: $100 Special license: NOFFMTIMIT: Far Field Microphone Recordings of the TIMIT Speech Corpus
The FFMTIMIT corpus contains the previously-unreleased secondary microphone waveforms for the TIMIT Acoustic-Phonetic Continuous Speech corpus. The primary microphone waveforms, which were recorded using a close-talking noise-cancelling head-mounted Sennheiser microphone (model HMD-414), are available from the LDC on NIST Speech Disc 1-1.1 (LDC93S1). The secondary microphone used in the recording of the TIMIT corpus was a Breul & Kjaer 1/2" free-field microphone (model 4165).
While the Sennheiser microphone recordings are relatively "clean" with respect to non-speech noise, the FFMTIMIT recordings includes significant low frequency noise, which was due to the HVAC system and mechanical vibration transmitted through the floor of the double-walled sound booth used in recording. Because it is noiser than its TIMIT counterpart, the data of FFMTIMIT may be used in the development of more noise-robust speech recognition systems. In addition, this data may be of value to researchers involved in vocal tract modeling because the B&K microphone has extremely flat free-field frequency response and calibration tones are provided.
Note that the B&K TIMIT data contained with this release has not been processed through any highpass filter, (e.g., the 1581-point filter described in the paper "The DARPA Speech Recognition Research Database" by Fisher, Doddington and Goudie-Marshall in "DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus CD-ROM," NISTIR 4930 / NTIS Order No. PB93- 173938.)
Item Name: FFMTIMIT LDC Catalog No.: LDC96S32 NIST Catalog No.: 21-1.1 LDC Release date: 5/96 Nonmember price: $100 Special license: NO
Association for Computational Linguistics Data Collection Initiative (ACL/DCI)
The ACL Data Collection Initiative disc contains text from: Wall Street Journal, copyright 1987, 1988, 1989, provided by Dow Jones, Inc.; the Collins English Dictionary, Copyright 1979, William Collins Sons Co., Ltd.; scientific abstracts provided by the U.S. Department of Energy; and a variety of gramatically tagged and parsed materials from the Treebank project at the University of Pennsylvania, copyright 1990,1991, University of Pennsylvania. The total amount of uncompressed text is 620 Mbytes.
The many formats in which the originals of these texts came have all, to one extent or another, been mapped into a markup language consistent with the SGML standard (ISO 8879).
The format of the material from the Wall Street Journal uses a
labelled bracketing, expressed in the style of SGML, although no
formal SGML DTD is provided. The tag set has been modified by turning
the Dow Jones header categories into tags and by creating ad hoc tages
such as ``
The Collins English Dictionary is present in two forms. One form was
approximately parsed into fielded records as an exercise in learning a
language called ``FIT'', by a student working under the direction of
Lloyd Nakatani at ATT Bell Laboratories during the summer of 1990.
The original digital image of the typographer's tape that the database
version was prepared from had serious flaws that were not detected and
corrected until later; the corrected version, a clean typographer's
tape, is presented in a separate directory. A properly-analyzed
database version will be provided in the future. The documentation
includes notes developed during the new attempt to analyze the tape
from scratch.
The Department of Energy abstracts reside in files that are
approximately one megabyte each. The original 950 separators have
been replaced with newlines, and space padding between articles was
removed. An acronym dictionary that was extracted from the database
as an indication of the material's topic areas has been included in a
separate directory.
Provisional material from the Penn Treebank project is divided into
two subdirectories on this disk. The subdirectory ``postext'' contains
text with part-of-speech annotations; ``parstext'' contains text with
syntactic bracketing.
README file is available.
Original treebank release
This CD-ROM contains over 1.6 million words of hand-parsed material
from the Dow Jones News Service, plus an additional 1 million words
tagged for part-of-speech. This material is a subset of the corpus for
the current DARPA large-vocabulary speech recognition project.
It also contains the first fully parsed version of the Brown Corpus,
which has also been completely retagged using the Penn Treebank tag
set. Also included are tagged and parsed data from Department of
Energy abstracts, IBM computer manuals, MUC-3, and ATIS.
In addition, the CD-ROM includes source code for several software
packages, including tgrep, which permits the user to search for
specific constituents in tree structures.
Release - 2
The Penn Treebank Project Release 2 CDROM features the new Penn
Treebank II bracketing style, which is designed to allow the
extraction of simple predicate/argument structure. Over one million
words of text are provided with this bracketing applied, along with
a complete style manual explaining the bracketing, and new versions
of tools for searching and treating bracketed data.
This CDROM also contains all the annotated text material from the
earlier Treebank Preliminary Release, including the Brown Corpus.
While these materials have not all been converted to the newer
bracketing style, they have been cleaned up to remove problems that
had appeared in the earlier release.
The contents of Treebank Release 2 are as follows:
Detailed questions about the corpus may be sent to
treebank@unagi.cis.upenn.edu, while questions and requests for
obtaining Treebank Release 2 should be sent to
ldc@unagi.cis.upenn.edu.
Further information is
available.
The TIPSTER project is sponsored by the Software and Intelligent
Systems Technology Office of the Advanced Research Projects Agency
(ARPA/SISTO) in an effort to significantly advance the state of the
art in effective document detection (information retrieval) and data
extraction from large, real-world data collections.
The detection data is comprised of a new test collection built at NIST
to be used both for the TIPSTER project and the related TREC project.
The TREC project has many other participating information retrieval
research groups, working on the same task as the TIPSTER groups, but
meeting once a year in a workshop to compare results (similar to MUC).
The test collection built at NIST consists of 3 disks (gigabytes) of
documents, 150 topics, and the answers (relevant documents) for those
topics.
The documents in the test collection are varied in style, size, and
subject domain. The first disk contains material from the Wall Street
Journal (1986, 1987, 1988, 1989), the AP Newswire (1989), the Federal
Register (1989), information from Computer Select disks (Ziff-Davis
Publishing), and short abstracts from the Department of Energy. The
second disk contains information from the same sources, but from
different years. The third disk contains more information from the
Computer Select disks, plus material from the San Jose Mercury News
(1991), more AP newswire (1990), and about 250 megabytes of formatted
U.S. Patents. The format of all the documents is relatively clean and
easy to use, with SGML-like tags separating documents and document
fields. There is no part-of-speech tagging or breakdown into
individual sentences or paragraphs as the purpose of this collection
is to test retrieval against real-world data.
The three Tipster discs so far released have been re-issued with
updates and corrections, and all recipients of the earlier versions
should have received these replacements free of charge. If you think
you have the unrevised original, contact LDC for confirmation.
README file can be reached
from here.
Directory Name & Description
Directory Name & Description
Directory Name & Description
This set of three compact discs contains documents provided
to the LDC by the United Nations, for use in research on machine
translation technology. The documents come from the Office of
Conference Services at the UN in New York, and are drawn from
archives that span the period between 1988 and 1993.
This publication contains the English, French and Spanish archives,
with data from each language stored on a separate disc in the set.
Care has been taken to arrange the document files in a parallel
directory structure for each language, so that corresponding
translations of a document are found directly by means of the
directory paths and file names.
All parallel files in this corpus are English-based: for every file on
the English disc, there will be a corresponding file on either the
French or Spanish disc, or both. Tables are included on all discs to
assist in determining which parallels are present. Due to the nature
and organization of UN translation services and the original
electronic text archives, the process of finding and sorting out
parallel documents yielded a numerous gaps, with many files in each
language having no parallel in other languages.
In preparing the text for publication, we have applied a
fully-compliant SGML format (Standard Generalized Markup Language).
For those researchers who use SGML, a working DTD (Document Type
Definition) is provided on each disc. For those who do not need SGML
markup, a simple script is included that can be used to filter out the
SGML-specific material, and leave only the plain text. The character
set used is the 8-bit ISO 8859-1 Latin1, in which accented letters and
some other non-ASCII characters occupy the upper 128 entries of the
character table.
README file.
The Linguistic Data Consortium announces the availability of a Japanese
language text corpus composed of business and financial news from two sources:
The data was received at the LDC on 9-track magnetic tape; the character
encoding was EBCDIC, but was standardized to EUC, which the LDC has chosen as
its standard for Japanese.
The copyright holders of this text have requested that it be made available to LDC members only. Inquiries
about the corpus or requests for it, or information about becoming members for the 1995 membership year
should be directed to ldc@unagi.cis.upenn.edu.
Further information about the LDC and its available corpora can be accessed on the Linguistic Data
Consortium WWW Home Page at URL http://www.cis.upenn.edu/~ldc. Information is also available via
ftp at ftp.cis.upenn.edu under pub/ldc; for ftp access, please use "anonymous" as your login name, and give
your email address when asked for password.
The Spanish News Corpus consists of journalistic text data from one
newspaper (El Norte, Mexico) and from the Spanish-language services
of three newswire sources: Agence France Presse, Associated Press
Worldstream, and Reuters. (The Reuters collection comprises two
distinct services: Reuters Spanish Language News Service and Reuters
Latin American Business Report.)
All text data are stored on one CD-ROM, in a standard compressed
form. The fours sets of newswire data (AFP, APWS, and two Reuters
services) are each organized as one data file per day of collection.
The period covered by these collections runs from December 1993 (for
APWS and Reuters) or May 1994 (APWS) through December 1995. (The El
Norte data, provided to us by INFOSEL Mexico, are arbitrarily grouped
into files of about 1 megabyte in size when uncompressed; date
information is not available for individual articles, but the general
period of the collection is 1993.)
The approximate amounts of data per source (when uncompressed) is
indicated below (in total megabytes and millions of words of text):
The copyright holders of this text have requested that it be made
available to LDC members only. Due to the release date this corpus is
available to 1995 and 1996 members. In order to obtain this corpus,
current LDC members must submit a signed User Agreement Form.
Documentation available.
The first release of the European Corpus Initiative, the Multilingual
Corpus 1 (ECI/MCI), has 46 subcorpora in 27 (mainly European)
languages. The total size of these is roughly 92 million (lexical)
words. The corpora are marked up using TEI P2 conformant SGML (to
varying levels of detail), with easy access to the source text without
markup. Twelve of the component corpora are multilingual parallel corpora
with from two to nine sub-corpora. All the alphabetic corpora (there
is some Japanese and Chinese) are encoded in the ISO LATIN family of
8-bit character sets (ISO 8859-1, -5 and -7). The CD-ROM is in High
Sierra format (ISO 9660), readable on UNIX, MSDOS and Apple systems at
least.
The amount of material per language varies, from about 36 million
words (German) to about 5 thousand words (Bulgarian). The majority of
sources are journalistic in nature (newspapers, magazines,
broadcasts); additional sources include dictionaries (Albanian,
Gaelic, Turkish, Japanese/English), literature, technical reports, and
proceedings or publications of international organizations. The table
on the next page lists the languages included, the subcorpus numbers
for each language (in parentheses), and the amount of data per
language in thousands of lexical words.
This corpus contains ASCII versions of the CELEX lexical databases of
English (version 2.5), Dutch (version 3.1) and German (version 2.0).
CELEX was developed as a joint enterprise of the University of
Nijmegen, the Institute for Dutch Lexicology in Leiden, the Max Planck
Institute for Psycholinguistics in Nijmegen, and the Institute for
Perception Research in Eindhoven. Pre-mastering and CD-ROM production
was done by the LDC.
For each language, this CD-ROM contains detailed information on :
A detailed User Guide
describing the various kinds of lexical information available is supplied.
All sections of this guide are
POSTSCRIPT files, except for some additional notes on the German
lexicon in plain ASCII.
CELEX-2
The second release of CELEX contains an enhanced, expanded version of the German lexical
database (2.5), featuring approximately 1000 new lemma entries, revised morphological
parses, verb argument structures, inflectional paradigm codes, and a corpus type lexicon.
A complete PostScript version of the Germanic Linguistic Guide is also included, in both
Eouropean A-4 format and American Letter format. For German, the total number of lemmas
included is now 51,728, while all their inflected forms number 365,530.
Moreover, phonetic syllable frequencies have been added for (British) English and Dutch.
Apart from this, and provision of frequency information alongside every lexical feature,
no changes have been made to Dutch and English lexicons.
Complete AWK-scripts are now provided to compute representations not found in the (plain
ASCII) lexical data files, corresponding to the features described in CELEX User Guide,
which is included on the CD as well.
For each language, i.e. English, German, and Dutch, the CD-ROM contains detailed
information on the orthography (variations in spelling, hyphenation), the phonology
(phonetic transcriptions, variations in pronunciation, syllable structure, primary
stress), the morphology (derivational and compositional structure, inflectional
paradigms), the syntax (word class, word-class specific subcategorisation, argument
structures), and word frequency (summed word and lemma counts, based on resent and
representative text corpora) of both wordforms and lemmas. Unique identity numbers allow
the linking of information from different files with the aid of an efficient, index-based
C-program.
Like its predecessor, the CD-ROM is mastered using the ISO 9660 daa format, with the Rock
Ridge extensions, allowing it to be used in VMS, MS-DOS, Macintosh and UNIX environments.
As the new release does not omit any data from the first edition, the current release will
replace the old one.
This is a three-part project: COMLEX English Syntax, COMLEX English
Pronunciation, and COMLEX English Semantics. The first two have
resulted in electronic dictionaries, released by LDC as MY94 products
and described below.
The Semantics will result in an annotated corpus
using WordNet, which is a public domain compendium of lexical semantic
relations, in 1995. Annotation of the same corpus using COMLEX Syntax
is also planned for 1995.
For a description of WordNet, see George Miller (ed.), WordNet: An
on-line lexical database, in International Journal of Lexicography
(special issue), 3(4):235-312, 1990, or George Miller, Claudia
Leacock, Randee Tengi, and Ross Bunker: A semantic concordance, in
Proceedings of the Human Language Technology Workshop, pages 303--308,
Princeton, NJ, March 1993.
These products are intended to provide a comprehensive set of lexical
resources for research and development in computational linguistics.
They will be revised and expanded continuously, with feedback from the
community of users, and current members will receive all new versions.
The initial (MY94) versions of the electronic dictionaries are being
distributed only by ftp. Contact LDC for instructions to obtain
license forms and the dictionaries.
This is a moderately broad
coverage English lexicon (with about 38,000 lemmas) developed at New
York University under LDC sponsorship. It contains detailed
information about the syntactic characteristics of each lexical item,
and is particularly detailed in its treatment of subcategorization
(complement structures). It includes 92 different subcategorization
features for verbs, 14 for adjectives, and 9 for nouns. These
features distinguish not only the different constituent structures
which may appear in a complement, but also the different control
features associated with a constituent structure.
Version 0, released in August 1994, is available by ftp to members who
sign a license agreement, which is also found on the LDC ftp site.
Some references for the syntax and semantics work:
Ralph Grishman, Catherine Macleod, and Adam
Meyers. Comlex syntax: Building a computational lexicon. To appear
in Proc. 15th Int'l Conf. Computational Linguistics (COLING 94),
Kyoto, Japan, August 1994.
The COMLEX English Pronunciation Dictionary, also known as PRONLEX, was first released in July
1994 as Version 0, and in revised form as Version 0.1 in February 1995. Version 0 contained 30,354
entries with representations of one or more citation pronunciations each, covering essentially the WSJ30K
vocabulary. Version 0.1 contains 66,135 entries, adding coverage of WSJ64K and SWITCHBOARD.
[WSJ30K and WSJ64K are word lists selected from several years of Wall Street Journal texts used in recent
ARPA Continuous Speech Recognition corpora. SWITCHBOARD is a three million word corpus of
telephone conversations on a variety of topics. All are available from LDC.]
The PRONLEX documentation, which is accessible by anonymous ftp, describes the principles observed for
word transcription (see the file PRONUNCIATION). Although predictable variation in pronunciation due
to dialect or variable reduction has not been notated, the documentation notes systematic dialectal variants
which may be generated by rule. In addition, alternate pronunciations are given for words whose
pronunciation varies by part of speech (e.g., abstrAct, Abstract), or in less systematic but salient ways
(especially names). Classes of exceptions to the transcription principles, such as names, function words,
and foreign words, are tagged, as described in the PRONUNCIATION file.
PRONLEX is a dynamic enterprise, intended to enhance the research capabilities of the entire LDC
community with publicly accessible resources of high quality and broad utility at reasonable cost. Its
success depends on members providing feedback in the form of corrections, additions, comments, and
suggestions for improvement. Please see the README file for instructions.
PRONLEX Version 0.1 was created under the direction of Cynthia McLemore at the Linguistic Data
Consortium, with research assistant Paul Kingsbury coordinating transcription activities. License forms
available by ftp in either postscript or latex form, at ftp.cis.upenn.edu, in the directory
pub/ldc/license\_forms. LDC members receive PRONLEX free; nonmembers may purchase a research-use
license only.
The
COMLEX English Pronunciation Dictionary Version 0.2, also known as PRONLEX Version 0.2,
released in July 1995, is a 90,694 word pronouncing dictionary of English, including
WSJ30K, WSJ64K, Switchboard, and additional lemmas from COMLEX syntax. (WSJ30K and WSJ64K
are word lists selected from several years of Wall Street Journal texts used in recent
ARPA Continuous Speech Recognition corpora. Switchboard is a three million word corpus of
telephone conversations on a variety of topics.) PRONLEX is available by ftp to
members who sign a license agreement, which is also found on the LDC ftp site. The
PRONLEX documentation describes the principles observed for word transcription. Although
predictable variation in pronunciation due to dialect or variable reduction has not been
notated in the lexicon itself, the documentation notes systematic dialectal variants,
which may be generated by rule. In addition, alternate pronunciations are given for words
whose pronunciation varies by part of speech (e.g., abstrAct, Abstract), or in less
systematic but salient ways (especially names). Classes of exceptions to the
transcription principles, such as names, function, words, and foreign words, are tagged.
PRONLEX Version 0.2 was created under the direction of Cynthia McLemore at the
Linguistic Data Consortium, with research assistant Paul Kingsbury coordinating
transcription activities.
Item Name: ACL/DCI
LDC Catalog No.: LDC93T1
NIST Catalog No.: NA
LDC Release date: 4/92 (MY93)
Nonmember price: 100
Special license: YES
The Penn Treebank Project - Release 2.
In addition, the Penn Treebank Project will be providing updates,
announcements and a discussion forum for users. A file of updates and
further information available via anonymous ftp from
ftp.cis.upenn.edu, in pub/treebank/doc/update.cd2. This file will
also contain pointers to a gradually expanding body of relatively
technical suggestions on how to extract certain information from the
corpus.
Item Name: PENN TREEBANK - 2
LDC Catalog No.: LDC95T7
NIST Catalog No.: NA
LDC Release date: 2/95 (MY95)
Nonmember price: 2500
Special license: NO
TIPSTER Information Retrieval Text Research Collection
Item Name: TIPSTER Complete
LDC Catalog No.: LDC93T3
NIST Catalog No.: NA
LDC Release date: 4/92 (MY93)
Nonmember price: 2500
Special license: YES
TIPSTER Volume 1, March 1992
/ap Associated Press Newswire material, copyright 1989
/fr Federal Register material, 1989
/wsj Wall Street Journal, copyright 1987, 1988, 1989
/doe Department of Energy abstracts
Item Name: TIPSTER vol.1
LDC Catalog No.: LDC93T3-1.1
NIST Catalog No.: NA
LDC Release date: 4/92 (MY93)
Nonmember price: 1000
Special license: YES
TIPSTER Volume 2, July 1992
/ap Associated Press Newswire material, copyright 1988
/fr Federal Register, 1988
/wsj Wall Street Journal, copyright 1990, 1991, 1992
/ziff Ziff-Davis Publishing, copyright 1989, 1990
/doe Department of Energy abstracts
Item Name: TIPSTER vol.2
LDC Catalog No.: LDC93T3-2.1
NIST Catalog No.: NA
LDC Release date: 7/92 (MY93)
Nonmember price: 1000
Special license: YES
TIPSTER Volume 3, April 1993
/ap Associated Press material, copyright 1990
/patents U.S.Patent documents, 1983-1991
/sjm San Jose Mercury News, copyright 1991
Item Name: TIPSTER vol.3
LDC Catalog No.: LDC93T3-3.1
NIST Catalog No.: NA
LDC Release date: 7/92 (MY93)
Nonmember price: 1000
Special license: YES
United Nations Parallel Text Corpus (English, French,
Spanish)
Item Name: UNITED NATIONS PARALLEL TEXT Complete Set
LDC Catalog No.: LDC94T4A
NIST Catalog No.: NA
LDC Release date: 4/94 (MY94)
Nonmember price: 2500
Special license: YES
Item Name: UNITED NATIONS PARALLEL TEXT English
LDC Catalog No.: LDC94T4B-1
NIST Catalog No.: NA
LDC Release date: 4/94 (MY94)
Nonmember price: 1000
Special license: YES
Item Name: UNITED NATIONS PARALLEL TEXT French
LDC Catalog No.: LDC94T4B-2
NIST Catalog No.: NA
LDC Release date: 4/94 (MY94)
Nonmember price: 1000
Special license: YES
Item Name: UNITED NATIONS PARALLEL TEXT Spanish
LDC Catalog No.: LDC94T4B-3.1
NIST Catalog No.: NA
LDC Release date: 4/94 (MY94)
Nonmember price: 1000
Special license: YES
Japanese Business News Text
Item Name: Japanese Business News Text
LDC Catalog No.: LDC95T8
NIST Catalog No.: NA
LDC Release date: 7/95
Nonmember price: Members Only
Special license: YES
Spanish News Text Collection
Source MB MW
-------------------
AFP 345 44
APWS 253 33
REUSL 333 41
REULA 233 23
INFOSEL 209 31
The presentation of text data in these collections is modeled on the
TIPSTER corpus. Within each data file, SGML tagging is used (1) to
mark article boundaries, (2) to delimit the text portion within each
article, and (3) to label various pieces of information about the
article that are external to the text content (e.g. headlines,
bylines, and so on).
Item Name: Spanish News Text Collection
LDC Catalog No.: LDC95T9
NIST Catalog No.: NA
LDC Release Date: 3/96
Nonmember price: Members Only
Special license: YES
ECI-1
Language (Subcorpus #) Kwords Totals
German (70) 34291 (09) 191 (65) 20 (28) 187
(29) 59 (30) 76 (47) 24 (59) 50
(71) 21 (70A) 999 35918
French (31) 4775 (04) 4121 (28) 187 (29) 59
(30) 76 (47) 24 (51) 6 (59) 50
(71) 21 (32) 1667 10986
Spanish (31) 4500 (13) 830 (14) 1041 (15) 447
(47) 24 (32) 1667 8 (59) 50 (71) 8580
English (31) 4222 (36) 1141 (74) 95 (28) 187
(47) 24 (51) 6 (56) 97 (59) 50
(71) 21 (32) 1667 7510
Dutch (03) 5500 (02) 600 (47) 24 (71) 21 6145
Czech (44) 4726 4726
Italian (11) 3518 (42) 303 (58) 13 (29) 59
(30) 76 (47) 24 (71) 21 4014
Chinese (78) 2895 2895
Greek (10) 2515 (47) 24 (59) 50 (71) 21 2610
Norwegian (41) 2226 2226
Swedish (37) 1718 1718
Serb/Croat/Slov(24) 700 (56) 289 989
Tibetan (76) 834 834
Portuguese (60) 675 (47) 24 (71) 21 720
Malay (80) 563 563
Russian (73) 364 364
Japanese (57) 203 203
Turkish (20) 173 (20A) 110 283
Albanian (82) 205 205
Gaelic (55) 141 141
Estonian (39) 100 100
Usbek (81) 88 88
Latin (74) 75 75
Danish (47) 24 (71) 21 45
Lithuanian (89) 20 20
Bulgarian (84) 5 5
Total 91969
Click here to see
the README file
Item Name: ECI/MCI
LDC Catalog No.: LDC94T5
NIST Catalog No.: NA
LDC Release date: 6/94 (MY94)
Nonmember price: 35
Special license: YES
Lexical Databases: Descriptions and Ordering Information
The databases have not been tailored to fit any particular
database management program. Instead, the information is in ASCII
files in a UNIX directory tree that can be queried with tools such as
AWK or ICON. Unique identity numbers allow the linking of information
from different files. Some kinds of information have to be computed
on-line; wherever necessary, AWK functions have been provided to
recover this information. README files specify the details of their
use.
Item Name: CELEX-2
LDC Catalog No.: LDC96L19
NIST Catalog No.: NA
LDC Release date: 12/95 (MY96)
Nonmember price: 150
Special license: YES
COMLEX: COMmon LEXical Database of English
Item Name: COMLEX English Syntax Lexicon, Version 1.1.1
LDC Catalog No.: LDC94L2 , LDC95L4, LDC95L6
NIST Catalog No.: NA
LDC Release date: 6/94 (MY94)
Nonmember price: 10,000
Special license: YES
COMLEX English Pronunciation
Item Name: COMLEX Pronouncing Dictionary, Version 0.1
LDC Catalog No.: LDC94L3, LDC95L5, LDC96L7
NIST Catalog No.: NA
LDC Release date: 6/94 (MY94)
Nonmember price: 10,000
Special license: YES
COMLEX English Pronunciation Version 0.2
Item Name: COMLEX Pronouncing Lexicon, Version 0.2
LDC Catalog No.: LDC95L3,LDC96L7
NIST Catalog No.: NA
LDC Release date: 7/95 (MY95)
Nonmember price: 10,000
Special license: YES