Studies
Sort By:
Studies: 2018 | Downloads: 48111
CIEMPIESS Lightby Mena, Carlos Daniel Hernández; Herrera, Abel
Description:

CIEMPIESS (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) Light was developed by the Speech Processing Laboratory of the Faculty of Engineering at the National Autonomous University of Mexico (UNAM) and consists of approximately 18 hours of Mexican Spanish radio and television speech and associated transcripts. The goal of this work was to create acoustic models for automatic speech recognition. For more information and documentation see the CIEMPIESS-UNAM Project website.

CIEMPIESS Light is an updated version of CIEMPIESS, released by LDC as LDC2015S07. This “light” version contains speech and transcripts presented in a revised directory structure that allows for use with the Kaldi toolkit.

Data

The speech recordings were collected from Podcast UNAM, a program created by Radio-IUS, and Mirador Universitario, a TV program broadcast by UNAM. They are comprised of spontaneous conversations in Mexican Spanish between a moderator and guests. Approximately 75% of the speakers were male, and 25% of the speakers were female.

The audio was recorded in MP3 stereo format, using a 44.1 kHz sample rate and bit-rate of 128 kbps or higher. Only “clean” utterances were selected from the raw data, meaning that the utterances were made by only one person with no background noises, whispers, music, foreign accents, white noise or static. The audio files were converted to 16 kHz, 16-bit PCM flac format for this release.

Transcripts are presented as UTF-8 encoded plain text.

Acknowledgements

The authors would like to thank Alejandro V. Mena, Elena Vera, Angélica Gutiérrez and Beatriz Ancira for their support with the social service program: "Desarrollo de Tecnologías del Habla.” They would also like to thank the social service students for their hard work.

hdl:11272/7WUJ0
0 downloads
Last Released: Dec 5, 2017
Description:

TAC KBP Chinese Cross-lingual Entity Linking - Comprehensive Training and Evaluation Data 2011-2014 was developed by the Linguistic Data Consortium and contains training and evaluation data produced in support of the TAC KBP Chinese Cross-lingual Entity Linking tasks in 2011, 2012, 2013 and 2014. It includes queries and gold standard entity type information, Knowledge Base links, and equivalence class clusters for NIL entities along with the source documents for the queries, specifically, English and Chinese newswire, discussion forum and web data. The corresponding knowledge base is available as TAC KBP Reference Knowledge Base (LDC2014T16).

Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST). TAC was developed to encourage research in natural language processing and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. Through its various evaluations, the Knowledge Base Population (KBP) track of TAC encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base.

Chinese Cross-lingual Entity Linking was first conducted as part of the 2011 TAC KBP evaluations. The track was an extension of the monolingual English Entity Linking track (EL) whose goal is to measure systems’ ability to determine whether an entity, specified by a query, has a matching node in a reference knowledge base (KB) and, if so, to create a link between the two. If there is no matching node for a query entity in the KB, EL systems are required to cluster the mention together with others referencing the same entity. More information about the TAC KBP Entity Linking task and other TAC KBP evaluations can be found on the NIST TAC website.

Data

All source documents were originally released as XML but have been converted to text files for this release. This change was made primarily because the documents were used as text files during data development but also because some fail XML parsing.

Acknowledgement

This material is based on research sponsored by Air Force Research Laboratory and Defense Advance Research Projects Agency under agreement number FA8750-13-2-0045. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory and Defense Advanced Research Projects Agency or the U.S. Government.

hdl:11272/GVMSH
0 downloads
Last Released: Dec 5, 2017
CHiME2 WSJ0by Vincent, Emmanuel; Barker, Jon; Watanabe, Shinji; Le Roux, Jonathan; Nesta, Francesco; Matassoni, Marco
Description:

CHiME2 WSJ0 was developed as part of The 2nd CHiME Speech Separation and Recognition Challenge and contains approximately 166 hours of English speech from a noisy living room environment. The CHiME Challenges focus on distant-microphone automatic speech recognition (ASR) in real-world environments.

CHiME2 WSJ0 reflects the medium vocabulary track of the CHiME2 Challenge. The target utterances were taken from CSR-I (WSJ0) Complete (LDC93S6A), specifically, the 5,000 word subset of read speech from Wall Street Journal news text.

LDC also released CHiME2 Grid (LDC2017S07).

Data

Data is divided into training, development and test sets. All data is provided as 16 bit WAV files sampled at 16 kHz. The noisy utterances are in isolated form and in embedded form. The latter involves five seconds of background noise before and after the utterance. Seven hours of noise background not part of the training set are also included.

Also included are baseline scoring, decoding and retraining tools based on Cambridge University’ s tool, HTK (the Hidden Markov Toolkit) and related recipes. These tools include three baseline speaker-independent recognition systems trained on clean, reverberated and noisy data, respectively, and a number of scripts.

hdl:11272/YLKXL
0 downloads
Last Released: Dec 4, 2017
UCLA High-Speed Laryngeal Video and Audioby Chen, Gang; Neubauer, Juergen; Garellek, Marc; Samlan, Robin; Gerratt, Bruce R.; Kreiman, Jody; Alwan, Abeer
Description:

UCLA High-Speed Laryngeal Video and Audio was developed by UCLA Speech Processing and Auditory Perception Laboratory and is comprised of high-speed laryngeal video recordings of the vocal folds and synchronized audio recordings from nine subjects collected between April 2012 and April 2013. Speakers were asked to sustain the vowel /i/ for approximately ten seconds while holding voice quality, fundamental frequency, and loudness as steady as possible.

In the field of speech production theory, data such as contained in this release may be used to study the relationship between vocal folds vibration and resulting voice quality.

Data

None of the subjects had a history of a voice disorder. There was no native language requirement for recruiting subjects; participants were native speakers of various languages, including English, Mandarin Chinese, Taiwanese Mandarin, Cantonese and German.

Audio data is presented as 16kHz 16-bit flac and video is in avi format at 5 fps (frames per second).

hdl:11272/I4CML
0 downloads
Last Released: Dec 4, 2017
RATS Keyword Spottingby Graff, David; Ma, Xiaoyi; Strassel, Stephanie; Walker, Kevin; Jones, Karen
Description:

RATS Keyword Spotting was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 3,100 hours of Levantine Arabic and Farsi conversational telephone speech with automatic and manual annotation of speech segments, transcripts and keywords generated from transcript content. The corpus was created to provide training, development and initial test sets for the keyword spotting (KWS) task in the DARPA RATS (Robust Automatic Transcription of Speech) program.

The goal of the RATS program was to develop human language technology systems capable of performing speech detection, language identification, speaker identification and keyword spotting on the severely degraded audio signals that are typical of various radio communication channels, especially those employing various types of handheld portable transceiver systems. To support that goal, LDC assembled a system for the transmission, reception and digital capture of audio data that allowed a single source audio signal to be distributed and recorded over eight distinct transceiver configurations simultaneously. Those configurations included three frequencies – high, very high and ultra high – variously combined with amplitude modulation, frequency hopping spread spectrum, narrow-band frequency modulation, single-side-band or wide-band frequency modulation. Annotations on the clear source audio signal, e.g., time boundaries for the duration of speech activity, were projected onto the corresponding eight channels recorded from the radio receivers.

Data

The source audio consists of conversational telephone speech recordings collected by LDC: (1) data collected for the RATS program from Levantine Arabic and Farsi speakers; and (2) material from Levantine Arabic QT Training Data Set 5, Speech (LDC2006S29) and CALLFRIEND Farsi Second Edition Speech (LDC2014S01).

Annotation was performed in two steps. Transcripts of calls were either produced or already available from the source corpora. For the CALLFRIEND Farsi calls, transcripts were updated by native Farsi speakers. Potential target keywords were selected from the transcripts on the basis of overall word frequencies to fall within a given range of target-word likelihood per hour of speech. The selected words were then reviewed by native speakers to confirm that each selection was a regular word or multi-word expression of more than three syllables.

All audio files are presented as single-channel, 16-bit PCM, 16000 samples per second; lossless FLAC compression is used on all files; when uncompressed, the files have typical “MS-WAV” (RIFF) file headers.

The data is divided for use as training, initial development set, and initial evaluation set (note that the initial evaluation only used Levantine Arabic data).

hdl:11272/OR6J5
0 downloads
Last Released: Dec 4, 2017
BOLT English Discussion Forumsby Tracey, Jennifer; Lee, Haejoong; Strassel, Stephanie
Description:

BOLT English Discussion Forums was developed by the Linguistic Data Consortium (LDC) and consists of 830,440 discussion forum threads in English harvested from the Internet using a combination of manual and automatic processes.

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources – discussion forums, text messaging and chat – in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference. The material in this release represents the unannotated English source data in the discussion forum genre.

Data

Collection was seeded based on the results of manual data scouting by native speaker annotators. Scouts were instructed to seek content in English that was original, interactive and informal. Upon locating an appropriate thread, scouts submitted the URL and some simple judgments about it to a database, via a web browser plug-in. When multiple threads from a forum were submitted, the entire forum was automatically harvested and added to the collection. The scale of the collection precluded manual review of all data. Only a small portion of the threads included in this release were manually reviewed, and it is expected that there may be some offensive or otherwise undesired content as well as some threads that contain a large amount of non-English content. Language identification was performed on all threads in this corpus (using CLD2), and threads for which the results indicate a high probability of largely non-English content are listed in eng_suspect_LID.txt in the docs directory of this package.

The corpus is comprised of zipped HTML and XML files. The HTML files are a raw HTML file downloaded from the discussion thread. If the thread spanned multiple URLs, it was stored as a concatenation of the downloaded HTML files. The XML files were converted from the raw HTML.

hdl:11272/FLLRC
0 downloads
Last Released: Dec 4, 2017
Description:

The Labour Force Survey provides estimates of employment and unemployment which are among the most timely and important measures of performance of the Canadian economy. With the release of the survey results only 10 days after the completion of data collection, the LFS estimates are the first of the major monthly economic data series to be released.

The Canadian Labour Force Survey was developed following the Second World War to satisfy a need for reliable and timely data on the labour market. Information was urgently required on the massive labour market changes involved in the transition from a war to a peace-time economy. The main objective of the LFS is to divide the working-age population into three mutually exclusive classifications - employed, unemployed, and not in the labour force - and to provide descriptive and explanatory data on each of these.

LFS data are used to produce the well-known unemployment rate as well as other standard labour market indicators such as the employment rate and the participation rate. The LFS also provides employment estimates by industry, occupation, public and private sector, hours worked and much more, all cross-classifiable by a variety of demographic characteristics. Estimates are produced for Canada, the provinces, the territories and a large number of sub-provincial regions. For employees, wage rates, union status, job permanency and workplace size are also produced. For a full listing and description of LFS variables, see the Guide to the Labour Force Survey (71-543-G), available through the "Publications" link above.

These data are used by different levels of government for evaluation and planning of employment programs in Canada. Regional unemployment rates are used by Employment and Social Development Canada to determine eligibility, level and duration of insurance benefits for persons living within a particular employment insurance region. The data are also used by labour market analysts, economists, consultants, planners, forecasters and academics in both the private and public sector.

Important note -- 4 August 2017

Labour Force Survey (LFS) data from January 2017 – July 2017 contained errors with numerical variables. Variables such as HRLYARN and UHRSMAIN were missing decimal place holders. As such, their values were off by a factor of 100. The issue has been addressed and the data for the year re-released

hdl:11272/10439
140 downloads + analyses
Last Released: Dec 1, 2017
Ancient Chinese Corpusby Chen, Xiaohe; Li, Bin; Feng, Minxuan; Xu, Chao; Xu, Runhua; Shi, Min; Yu, Lili; Xiao, Lei; Wang, Qingqing
Description:

Ancient Chinese Corpus was developed at Nanjing Normal University. It contains word-segmented and part-of-speech tagged text from Zuozhuan, an ancient Chinese work believed to date from the Warring States Period (475-221 BC). Zuozhuan is a commentary on the Chunqui, a history of the Chinese Spring and Autumn period (770-476 BC). This release is part of a continuing project to develop a large, part-of-speech tagged ancient Chinese corpus.

Data

Ancient Chinese Corpus consists of 180,000 Chinese characters and 195,000 segment units (including words and punctuation). The part-of-speech tag set was developed by Nanjing Normal University and contains 17 tags.

This release contains two text files: 268 paragraphs and 10,560 lines. A line is one sentence; paragraphs are separated by one empty line. Each word is tagged with its part-of-speech and separated by a space.

The files are presented in UTF-8 plain text files using traditional Chinese script.

hdl:11272/Q2YZM
0 downloads
Last Released: Nov 29, 2017
English Web Treebank Propbankby O'Gorman, Tim; Conger, Katherine; Palmer, Martha
Description:

English Web Treebank Propbank, LDC Catalog Number LDC2017T15 and ISBN 1-58563-818-8, was developed by the University of Colorado Boulder - CLEAR (Computational Language and Education Research) and provides predicate-argument structure annotation for English Web Treebank (LDC2012T13).

The goal of Propbank (or proposition bank) annotation is to develop annotations with information about basic semantic propositions. English Web Treebank Propbank provides semantic role annotation and predicate sense disambiguation for roughly 50,000 predicates, corresponding to all verbs, all adjectives in equational clauses and all nouns considered to be predicative. Mark-up is in the “unified” propbank annotation format, which combines representations in nouns, verbs and adjectives.

Data

The source data consists of weblogs, newsgroups, email, reviews and questions-answers. Human annotators followed the guidelines included with this release. Annotated propositions were automatically validated to ensure that (1) pointers to the tree nodes were valid, (2) Propbank labels were valid, and (3) Propbank annotation was consistent with the associated frameset.

Additionally, XML frame files were validated against the included dtd and were checked for frame internal consistency (e.g. misspelling, extraneous characters, general correctness). Data is presented in UTF-8 XML files.

hdl:11272/Y5ABH
0 downloads
Last Released: Nov 29, 2017
MWE-Aware English Dependency Corpus 2.0by Kato, Akihiko; Shindo, Hiroyuki; Matsumoto, Yuji
Description:

MWE-Aware English Dependency Corpus Version 2.0 was developed by the Nara Institute of Science and Technology Computational Linguistics Laboratory and consists of English compound function words annotated in dependency format. The data is derived from OntoNotes Release 5.0 (LDC2013T19).

Compound functions words are a type of multiword expression (MWE). MWEs are groups of tokens that can be treated as a single semantic or syntactic unit. Doing so facilitates natural language processing tasks such as constituency and dependency parsing.

Version 2.0 adds annotations of named entities (persons, locations, organizations) into dependency trees that are aware of compound function words. Version 1.0 is available from LDC as MWE-Aware English Dependency Corpus (LDC2017T01).

Data

MWE-Aware English Dependency Corpus Version 2.0 was derived from the Wall Street Journal portion of OntoNotes Release 5.0. MWEs were identified in OntoNotes’ phrase structure trees and each MWE was established as a single subtree. Those phrase structure subtrees were then converted to a dependency structure (the Stanford dependencies) in CoNLL format.

The data is split into 1,728 phrase structure trees as *.parse files and a single 14-column tab separated dependency as a *.conll file. Both file types are encoded as UTF-8.

hdl:11272/EIZA8
0 downloads
Last Released: Nov 29, 2017
 
 
Abacus Dataverse Network - British Columbia Research Library Data Services - Hosted at the University of British Columbia © 2017