Sort By:
Studies: 2088 | Downloads: 73368
CIEMPIESS Experimentationby Mena, Carlos Daniel Hernández

CIEMPIESS (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) Experimentation was developed by the social service program "Desarrollo de Tecnologías del Habla" of the "Facultad de Ingeniería" (FI) at the National Autonomous University of Mexico (UNAM) and consists of approximately 22 hours of Mexican Spanish broadcast and read speech with associated transcripts. The goal of this work was to create acoustic models for automatic speech recognition. For more information and documentation see the CIEMPIESS-UNAM Project website.

CIEMPIESS Experimentation is a set of three different data sets, specifically Complementary, Fem and Test. Complementary is a phonetically-balanced corpus of isolated Spanish words spoken in Central Mexico. Fem contains broadcast speech from 21 female speakers, collected to balance by gender the number of recordings from male speakers in other CIEMPIESS collections. Test consists of 10 hours of broadcast speech and transcripts and is intended for use as a standard test data set alongside other CIEMPIESS corpora. See the included documentation for more details on each corpus.

LDC has released the following data sets in the CIEMPIESS series:

  • CIEMPIESS (LDC2015S07)
  • CHM150 (LDC2016S04)
  • CIEMPIESS Light (LDC2017S23)
  • CIEMPIESS Balance (LDC2018S11)


The majority of the speech recordings in Fem and Test were collected from Radio-IUS, a UNAM radio station. Other recordings were taken from IUS Canal Multimedia and Centro Universitario de Estudios Jurídicos (CUEJ UNAM). Those two channels feature videos with speech around legal issues and topics related to UNAM. The Complementary recordings consist of read speech collected for that corpus.

Complementary includes specifications for creating transcripts using the phonetic alphabet Mexbet and for converting Mexbet output to the International Phonetic Alphabet and X-SAMPA. An automatic phonetizer for Mexbet, written in Python 2.7, to create pronouncing dictionaries is provided as well.

The audio files are presented as 16 kHz, 16-bit PCM flac format for this release. Transcripts are presented as UTF-8 encoded plain text.

Last Released: Jun 21, 2019
Anthology of Fourteen Canadian Poets, 1972by Djwa, Sandra; Coulthard, WJ; Herring, W

The poetry tape contains the collected poems of fourteen Canadian poets, stored in a compact format and keyed by uniquely constructed poem code numbers. These poems were collected and keypunched under the supervision of Mrs. Sandra Djwa of the Department of English at UBC.


The original FORTRAN IV programs to parse the text data are not available, although the functionality was limited to:

  1. Creating a list of poem titles
  2. Displaying poems by index number

UBC Library Data Services note, 21 June 2019

Last Released: Jun 21, 2019

TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014 was developed by the Linguistic Data Consortium (LDC) and contains training and evaluation data produced in support of the TAC KBP Chinese Regular Slot Filling evaluation track conducted in 2014.

Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST). TAC was developed to encourage research in natural language processing and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. Through its various evaluations, the Knowledge Base Population (KBP) track of TAC encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base.

The regular Chinese Slot Filling evaluation track involved mining information about entities from text. Slot Filling can be viewed as more traditional Information Extraction, or alternatively, as a Question Answering task, in which the questions are static but the targets change. In completing the task, participating systems and LDC annotators searched a corpus for information on certain attributes (slots) of person and organization entities and attempted to return all valid answers (slot fillers) in the source collection. For more information about Chinese Slot Filling, please refer to the 2014 track home page.


This release contains all evaluation and training data developed in support of TAC KBP Chinese Regular Slot Filling. This includes queries, the 'manual runs' (human-produced responses to the queries), the final rounds of assessment results and the complete set of Chinese source documents. All text data is encoded as UTF-8.

Last Released: Jun 20, 2019
FactBank 1.0by Sauri, Roser; Pustejovsky, James

FactBank 1.0, Linguistic Data Consortium (LDC) catalog number LDC2009T23 and isbn 1-58563-522-7, consists of 208 documents (over 77,000 tokens) from newswire and broadcast news reports in which event mentions are annotated with their degree of factuality, that is, the degree to which they correspond to those events. FactBank 1.0 was built on top of TimeBank 1.2 and a fragment of the AQUAINT TimeML Corpus, both of which used the TimeML specification language. This resulted in a double-layered annotation of event factuality. TimeBank 1.2 and AQUAINT TimeML encode most of the basic structural elements expressing factuality information while FactBank 1.0 represents the resulting factuality interpretation. The combination of the factuality values in FactBank with the structural information in TimeML-annotated corpora facilitates the development of tools aimed at automatically identifying the factuality values of events, a component fundamental in tasks requiring some degree of text understanding, such as Textual Entailment, Question Answering, or Narrative Understanding.

FactBank annotations indicate whether the event mention describes actual situations in the world, situations that have not happened, or situations of uncertain interpretation. Event factuality is not an inherent feature of events but a matter of perspective. Different discourse participants may present divergent views about the factuality of the very same event. Consequently, in FactBank, the factuality degree of events is assigned relative to the relevant sources at play. In this way, it can adequately reflect the divergence of opinions regarding the factual status of events, as is common in news reports.

The annotation language is grounded on established linguistic analyses of the phenomenon, which facilitated the creation of a battery of discriminatory tests for distinguishing between factuality values. Furthermore, the annotation procedure was carefully designed and divided into basic, sequential annotation tasks. This made it possible for hard tasks to be built on top of simpler ones, while at the same time allowing annotators to become incrementally familiar with the complexity of the problem. As a result, FactBank annotation achieved a relatively high interannotation agreement, kappa=0.81, a positive result when considered against similar annotation efforts.


All FactBank markup is standoff and is represented through a set of 20 tables which can be easily loaded into a database. Each table resides in an independent text file, where fields are separated by three consecutive bars (i.e., |||). The data in fields of string type are presented between simple quotations (').

Because FactBank 1.0 was built on top of TimeBank 1.2 and AQUAINT TimeML, both of which are marked up with inline XML-based annotation, this release contains the TimeBank 1.2 and AQUAINT TimeML annotation in standoff, table-based format as well.

Last Released: Jun 20, 2019

The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at The corpus includes:

  • Over 1.8 million articles (excluding wire services articles that appeared during the covered period).
  • Over 650,000 article summaries written by library scientists.
  • Over 1,500,000 articles manually tagged by library scientists with tags drawn from a normalized indexing vocabulary of people, organizations, locations and topic descriptors.
  • Over 275,000 algorithmically-tagged articles that have been hand verified by the online production staff at
  • Java tools for parsing corpus documents from .xml into a memory resident object.

As part of the New York Times' indexing procedures, most articles are manually summarized and tagged by a staff of library scientists. This collection contains over 650,000 article-summary pairs which may prove to be useful in the development and evaluation of algorithms for automated document summarization. Also, over 1.5 million documents have at least one tag. Articles are tagged for persons, places, organizations, titles and topics using a controlled vocabulary that is applied consistently across articles. For instance if one article mentions "Bill Clinton" and another refers to "President William Jefferson Clinton", both articles will be tagged with "CLINTON, BILL".

The New York Times has established a community website for researchers working on the data set at and encourages feedback and discussion about the corpus.


The text in this corpus is formatted in News Industry Text Format (NITF) developed by the International Press Telecommunications Council, an independent association of news agencies and publishers. NITF is an XML specification that provides a standardized representation for the content and structure of discrete news articles. NITF encompasses structural markup such as bylines, headlines and paragraphs. The format also provides management attributes for categorizing articles into topics, summarization usage restrictions and revision histories. The goals of NITF are to answer the essential questions inherent in news articles: Who, What, When, Where and Why.

  • Who: Who owns the copyright, who has rights to republish the article and who the article is about.
  • What: The subjects reported, the named entities inside the article and the events it describes.
  • When: When the article was written, when it was issued and when it was revised.
  • Where: Where the article was written, where the events took place and where it was delivered.
  • Why: The metadata describing the newsworthiness of the article.

Last Released: Jun 17, 2019
Korean Telephone Conversations Transcriptsby Ko, Eon-Suk; Han, Na-Rae; Strassel, Stephanie; Martey, Nii
Last Released: Jun 14, 2019

The Social Policy Simulation Database and Model (SPSD/M) is a tool designed to assist those interested in analyzing the financial interactions of governments and individuals in Canada. It can help one to assess the cost implications or income redistributive effects of changes in the personal taxation and cash transfer system. As the name implies, SPSD/M consists of two integrated parts: a database (SPSD), and a model (SPSM). The SPSD is a non-confidential, statistically representative database of individuals in their family context, with enough information on each individual to compute taxes paid to and cash transfers received from government. The SPSM is a static accounting model which processes each individual and family on the SPSD, calculates taxes and transfers using legislated or proposed programs and algorithms, and reports on the results. A sophisticated software environment gives the user a high degree of control over the inputs and outputs to the model and can allow the user to modify existing programs or test proposals for entirely new programs. The model comes with full documentation including an on-line help facility.

Users and Applications

The SPSD/M has been used in hundreds of sites across Canada. These sites have diverse research interests in the area of income tax-transfer and commodity tax systems in Canada as well as varied experience in micro-simulation. Our growing client base includes federal departments, provincial governments, universities, interest groups, corporate divisions, and private consultants. The diverse applications of the SPSD/M can be seen in the following examples of studies and published research reports:

  • Costing out proposals for amendments to the Income Tax Act affecting the tax treatment of seniors and the disabled
  • Estimating the fiscal viability of major personal tax reform options, including three flat tax scenarios
  • The comparison low income (poverty) measures and their effect on the estimates of the number of poor
  • An Analysis of the Distributional Impact of the Goods and Services Tax
  • Married and Unmarried Couples: The Tax Question
  • Taxes and Transfers in Rural Canada
  • Equivalencies in Canadian Public Policy
  • When the Baby Boom Grows Old: Impact on Canada's Public Sector

Some potential uses of the model are illustrated by the following list of questions which may be answered using the SPSM:

  • How large an increase in the federal Child Tax Benefit could be financed by allocating an additional $500 million to the program?
  • Which province would have the most advantageous tax structure for an individual with $45,000 earned income, 2 children and $15,000 of investment income?
  • What is the after-tax value of the major federal child support programs on a per child basis, and how are these benefits distributed across family types and income groups?
  • How many individuals otherwise paying no tax would have to pay tax under various minimum tax systems, and what would additional government revenues be?
  • How much money would be needed to raise all low income families and persons to Statistics Canada's low income cut-offs in 2014?
  • How much would average household "consumable" income rise if a province eliminated its gasoline taxes?
  • How much would federal government revenue rise by if there was an increase in the GST rate?

Last Released: Jun 14, 2019
RST Discourse Treebankby Carlson, Lynn; Marcu, Daniel; Okurowski, Mary Ellen

Rhetorical Structure Theory (RST) Discourse Treebank was developed by researchers at the Information Sciences Institute (University of Southern California), the US Department of Defense and the Linguistic Data Consortium (LDC). It consists of 385 Wall Street Journal articles from the Penn Treebank annotated with discourse structure in the RST framework along with human-generated extracts and abstracts associated with the source documents.

In the RST framework (Mann and Thompson, 1988), a text's discourse structure can be represented as a tree in four aspects: (1) the leaves correspond to text fragments called elementary discourse units (the mininal discourse units); (2) the internal nodes of the tree correspond to contiguous text spans; (3) each node is characterized by its nuclearity, or essential unit of information; and (4) each node is also characterized by a rhetorical relation between two or more non-overlapping, adjacent text spans.


The data in this release is divided into a training set (347 documents) and a test set (38 documents). All annotations were produced using a discourse annotation tool that can be downloaded from

Human-generated material in the corpus includes (1) long and short abstracts for 30 documents that were intended to convey the essential information and the main topic of the article, respectively; and (2) long, short and informative extracts for 180 documents, some of which were created from scratch and some of which were derived from the humanly-producted abstracts indicated above.

Last Released: Jun 13, 2019

LFS data are used to produce the well-known unemployment rate as well as other standard labour market indicators such as the employment rate and the participation rate. The LFS also provides employment estimates by industry, occupation, public and private sector, hours worked and much more, all cross-classifiable by a variety of demographic characteristics. Estimates are produced for Canada, the provinces, the territories and a large number of sub-provincial regions. For employees, data on wage rates, union status, job permanency and establishment size are also produced.

These data are used by different levels of government for evaluation and planning of employment programs in Canada. Regional unemployment rates are used by Employment and Social Development Canada to determine eligibility, level and duration of insurance benefits for persons living within a particular employment insurance region. The data are also used by labour market analysts, economists, consultants, planners, forecasters and academics in both the private and public sector.

28 downloads + analyses
Last Released: Jun 13, 2019

Small area data on citizens and families aged 55 years and over for Canada, British Columbia and the following British Columbia postal communities: Dawson Creek, Fort St. John, Prince George, Prince Rupert, Quesnel, Smithers, Terrace, Williams Lake.

Last Released: May 15, 2019
Abacus Dataverse Network - British Columbia Research Library Data Services - Hosted at the University of British Columbia © 2018