The UBC Library data collection, which includes all data from common data sources for easy, one-stop data shopping.
UBC Library Data Services
Sort By:
Studies: 1999 | Downloads: 56444
GALE Phase 4 Arabic Broadcast News Speechby Walker, Kevin; Caruso, Christopher; Maeda, Kazuaki; DiPersio, Denise; Strassel, Stephanie
Description:

GALE Phase 4 Arabic Broadcast News Speech was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 37 hours of Arabic broadcast news speech collected in 2008 and 2009 by LDC and MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding transcripts are released as GALE Phase 4 Arabic Broadcast News Transcripts (LDC2018T14).

Broadcast audio for the GALE program was collected at LDC’s Philadelphia, PA USA facilities and at three remote collection sites: Hong Kong University of Science and Technology (Chinese), Medianet (Arabic) and MTC (Arabic). The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.

LDC’s local broadcast collection system is highly automated, easily extensible and robust and capable of collecting, processing and evaluating hundreds of hours of content from several dozen sources per day. The broadcast material is served to the system by a set of free-to-air (FTA) satellite receivers, commercial direct satellite systems (DSS) such as DirecTV, direct broadcast satellite (DBS) receivers, and cable television (CATV) feeds. The mapping between receivers and recorders is dynamic and modular. All signal routing is performed under computer control, using a 256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V format and are then processed to extract audio, to generate keyframes and compressed audio/video, to produce time-synchronized closed captions (in the case of North American English) and to generate automatic speech recognition (ASR) output. An overview of the system, the sources recorded and the configuration of the recording laboratory are contained in the Guidelines for Broadcast Audio Collection Version 3.0 included in this release.

LDC designed a portable platform for remote broadcast collection. This is a TiVO-style digital video recording (DVR) system that records two streams of A/V material simultaneously. It supports analog CATV (NTSC and PAL) and FTA DVB-S satellite programming and can operate outside of the United States. It has a small footprint, weighs less than 30 pounds and can be transported as carry-on luggage.

Medianet collected Arabic programming from across the Gulf region using its internal system and LDC’s portable broadcast collection platform installed in 2008. The portable platform deployed at the Medianet Tunisian collection facility collected multiple streams of regional Arabic programming from various sources. MTC collected Arabic programming using its internal collection system.

Data

The recordings in this release feature news broadcasts focusing principally on current events from the following sources: Abu Dhabi TV, a television station based in Abu Dhabi, United Arab Emirates; Al Arabiya, a news television station based in Dubai; Al Baghdadya , an Iraqi broadcast programmer; Alhurra, a U.S. government-funded regional broadcaster; Al Iraqiyah, an Iraqi television station; Aljazeera , a regional broadcaster located in Doha, Qatar; Al Ordiniyah, a national broadcast station in Jordan; Kuwait TV, a national broadcast station based in Kuwait; Radio Sawa, a U.S. government-funded regional broadcaster; Saudi TV, a national television station based in Saudi Arabia; Syria TV, the national television station in Syria; and Yemen TV, a television station based in Yemen.

This release contains 51 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings; as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded; and as a guide for data selection by retaining information about a program’s genre, data type and topic.

Production Date:May 15, 2018
Producer:Linguistic Data Consortium (LDC), University of Pennsylvania
Distributor:Linguistic Data Consortium (LDC), University of Pennsylvania
hdl:11272/TSKSR
0 downloads
Last Released: Jun 20, 2018
GALE Phase 4 Arabic Broadcast News Transcriptsby Glenn, Meghan; Lee, Haejoong; Strassel, Stephanie; Maeda, Kazuaki
Description:

GALE Phase 4 Arabic Broadcast News Transcripts was developed by the Linguistic Data Consortium (LDC) and contains transcriptions of approximately 37 hours of Arabic broadcast news speech collected in 2008 and 2009 by the Linguistic Data Consortium (LDC), MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding audio data is released as GALE Phase 4 Arabic Broadcast News Speech (LDC2018S05).

The recordings for transcription feature news broadcasts focusing principally on current events from the following sources: Abu Dhabi TV, a television station based in Abu Dhabi, United Arab Emirates; Al Arabiya, a news television station based in Dubai; Al Baghdadya , an Iraqi broadcast programmer; Alhurra, a U.S. government-funded regional broadcaster; Al Iraqiyah, an Iraqi television station; Aljazeera, a regional broadcaster located in Doha, Qatar; Al Ordiniyah, a national broadcast station in Jordan; Kuwait TV, a national broadcast station based in Kuwait; Radio Sawa, a U.S. government-funded regional broadcaster; Saudi TV, a national television station based in Saudi Arabia; Syria TV, the national television station in Syria; and Yemen TV, a television station based in Yemen.

Data

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 204,735 tokens. The transcripts were created with the LDC tool XTrans, which supports manual transcription and annotation of audio recordings. XTrans is available from the following link, https://www.ldc.upenn.edu/language-resources/tools/xtrans/downloads.

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC’s quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release. QTR transcription consists of quick (near-) verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. It does not include sentence unit annotation. QRTR annotation adds structural information such as topic boundaries and manual sentence unit annotation to the core components of a quick transcript. Files with QTR as part of the filename were developed using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.

Production Date:May 15, 2018
Producer:Linguistic Data Consortium (LDC), University of Pennsylvania
Distributor:Linguistic Data Consortium (LDC), University of Pennsylvania
hdl:11272/G0A6N
0 downloads
Last Released: Jun 20, 2018
Rhythm and Pitchby Dilley, Laura C.; Breen, Mara; Brown, Meredith; Gibson, Edward
Description:

Rhythm and Pitch contains approximately 27 minutes of spontaneous English conversations and radio news stories annotated with the Rhythm and Pitch (RaP) scheme. Speech data for annotation was taken from two corpora released by LDC, CALLHOME American English Speech (LDC97S42) and Boston University Radio Speech Corpus (LDC96S36).

The RaP system permits the capture of both intonational and rhythmic aspects of speech. Four labeling tiers are used for annotating speech prosody. These tiers carry information about the syllabic organization and orthography of the speech, its rhythmic structure, tonal patterns, and other information. More information about the RaP system is available on the RaP homepage.

Data

Speech data are presented as flac compressed 16-bit wav files. The Boston data are one channel 16kHz files, while the CALLHOME data are either one or two channel 8kHz files. Annotations are UTF-8 encoded Praat TextGrids.

Production Date:May 15, 2018
Producer:Linguistic Data Consortium (LDC), University of Pennsylvania
Distributor:Linguistic Data Consortium (LDC), University of Pennsylvania
hdl:11272/FZBLL
0 downloads
Last Released: Jun 20, 2018
Description:

The "Direction Of Trade Statistics" data file contains annual, quarterly, and monthly IMF time series of country-level distributions of exports and imports, from 1948 to the latest received -- currently December 1997.

Production Date:1997
Producer:Inter-university Consortium for Political and Social Research (ICPSR), University of Michigan
Distribution Date:November 18, 2009
Distributor:Inter-university Consortium for Political and Social Research (ICPSR), University of Michigan
hdl:11272/VG4LS
0 downloads
Last Released: Jun 20, 2018
Description:

Dissipation Rate (\epsilon, units: [W/kg]) of Turbulent Kinetic Energy, derived from microstructure measurements of velocity shear and temperature gradients. Shear and temperature-gradient turbulence spectra are also included, as are CTD measurements and key glider variables. All measurements are taken from a Slocum G2 ocean glider.

hdl:11272/10613
0 downloads
Last Released: Jun 15, 2018
Description:

Canada’s rapidly changing demographic profile, along with its accompanying social and economic issues, has led to much discussion concerning the relationship between work, lifestyle and well-being. Gauging the quality of life at work can help diagnose issues relating to productivity, morale, efficiency and equity. Charting patterns of home and leisure activities can take the temperature of Canadian culture. Bringing these two together will provide insight on the health and well-being of Canadians as they meet the challenges of the future.

The General Social Survey Program’s new cycle,Canadians at Work and Home, takes a comprehensive look at the way Canadians live by incorporating the realms of work, home, leisure, and overall well-being into a single unit. Data users have expressed a strong interest in knowing more about the lifestyle behaviour of Canadians that impact their health and well-being both in the workplace and at home. The strength of this survey is its ability to take diverse information Canadians provide on various facets of life and combine them in ways not previously possible with surveys that covered one main topic only.

The survey includes a multitude of themes. In the work sphere, it explores important topics such as work ethic, work intensity and distribution, compensation and employment benefits, work satisfaction and meaning, intercultural workplace relations, and bullying and harassment. On the home front, questions include family activity time, the division of labour and work-life balance. The survey also covers eating habits and nutritional awareness, the use of technology, sports and outdoor activities, and involvement in cultural activities. New-to-GSS questions on purpose in life, opportunities, life aspirations, outlook and resilience complement previously asked ones on subjective well-being, stress management and other socioeconomic variables.

Within Canada, all levels of government, academics and not-for-profit organizations have expressed interest in the results. Data from this survey will assist with program and policy decisions and research of all kinds interested in exploring the workplace, home life and leisure activities of Canadians from all areas of life. In addition, some of the data from this survey will be comparable internationally.

Production Date:June, 2018
Producer:Statistics Canada (Statcan)
Distribution Date:June 07, 2018
Distributor:Statistics Canada (Statcan)
hdl:11272/10612
8 downloads + analyses
Last Released: Jun 12, 2018
Description:

LFS data are used to produce the well-known unemployment rate as well as other standard labour market indicators such as the employment rate and the participation rate. The LFS also provides employment estimates by industry, occupation, public and private sector, hours worked and much more, all cross-classifiable by a variety of demographic characteristics. Estimates are produced for Canada, the provinces, the territories and a large number of sub-provincial regions. For employees, data on wage rates, union status, job permanency and establishment size are also produced.

These data are used by different levels of government for evaluation and planning of employment programs in Canada. Regional unemployment rates are used by Employment and Social Development Canada to determine eligibility, level and duration of insurance benefits for persons living within a particular employment insurance region. The data are also used by labour market analysts, economists, consultants, planners, forecasters and academics in both the private and public sector.

Production Date:February 09, 2018
Producer:Statistics Canada (Statcan)
Distributor:Statistics Canada (Statcan)
hdl:11272/10575
28 downloads + analyses
Last Released: Jun 12, 2018
Description:

ICIS Cadastre The ICIS Cadastre is ICIS’ answer to the challenge of providing a single source parcel layer for its members. The data in this layer includes the best available parcel data from both Provincial and Local Government sources with standardized and uniform attribution. In areas where both Local Government and GeoBC contribute data, the Local Government data prevails. All ICIS Cadastre data is delivered via GeoShare, ICIS’s automated delivery mechanism, and is refreshed on a weekly basis. The data for each jurisdiction comes from either the local government or GeoBC, and is joined to standardized attribution established by GeoBC to create a uniform cadastre across boundaries and to verify that all registered parcels are included in the fabric.

ICF The Integrated Cadastral Fabric is produced by GeoBC’s Parcel Fabric Section on behalf of ICIS. The ICF layer includes many jurisdictions that are maintained on a bi-weekly basis by GeoBC to published attribute and currency standards. In other jurisdictions where the data is not actively maintained by GeoBC, the ICF includes local government parcel shapes with standardized Provincial attribution. An attribute on the fabric distinguishes between parcels that have been maintained by GeoBC and parcels that have been integrated from other ICIS members. More information about the ICF compilation program can be found at http://apps.gov.bc.ca/pip. GeoBC’s ICF Status Map provides a visual index to the current state of completion and maintenance of the ICF.

Local Government Cadastre. The Local Government Cadastre (As-Is-Cadastre) is a parcel fabric assembled entirely from local government data. Submissions are provided manually on an ad hoc basis and are not included in the ICIS Cadastre. Whatever datasets have been received within the last 30 days are manually loaded into the Local Government Cadastre once a month. This layer does not contain standardized attribution, although we attempt to reconcile fields with similar content as best as we can. The positional accuracy and attribution varies from local government to local government.

BC Assessment. The BC Assessment Fabric is a geospatial representation of the assessment roll. It contains a record for almost every assessed property in British Columbia. Unlike a legal cadastre fabric, it is an ownership fabric. This means there may be many legal lots represented by one assessed property or folio. Properties that do not have a spatial representation defined by its corresponding local government are occasionally represented using a diamond shape as a placeholder. This is done to assist finding a property’s location, and is usually a precursor to having a property boundary defined. The BC Assessment Fabric represents 99.14% of the assessment roll as of January 2015

Address BC. AddressBC is a centralized, geospatial, civic address repository, populated with address data from ICIS Local Government members and standardized to the AddressBC data model as a point feature class.

Other data: Agricultural Land Reserve Parcels, BC Centre for Disease Control Growing Days, Canadian Wildlife Service Boundaries, Conservation Parcels, Health Care Facilities, Police Jurisidiction Boundaries.

Producer:Integrated Cadastral Information Society (ICIS); BCGOV ILMB Crown Registry and Geographic Base Branch (CRGB)
hdl:11272/10167
117 downloads
Last Released: May 28, 2018
Description:

The SOG biogeochemical model of the Strait of Georgia is a 1-dimensional vertical model capable of predicting near-surface mixing, phytoplankton and zooplankton cycles, and carbonate chemistry. This dataset contains SOG model output data from 2001 to 2012 across 18 freshwater dissolved inorganic carbon (DIC) and total alkalinity (TA) scenarios. The 18 scenarios span a grid of 6 freshwater TA cases and 3 freshwater DIC:TA cases. The values for these freshwater scenarios were chosen based on a detail analysis of the Fraser River. Please refer to the methods section of the primary publication for more information.

Production Date:January 26, 2018
Producer:Ben Moore-Maley, UBC EOAS
Distribution Date:May 25, 2018
Distributor:Ben Moore-Maley, UBC EOAS
hdl:11272/10606
1 download
Last Released: May 27, 2018
Concretely Annotated New York Timesby Ferraro, Francis; Thomas, Max; Wolfe, Travis; R. Gormley, Matthew; Harman, Craig; Van Durme, Benjamin
Description:

Introduction

Concretely Annotated New York Times was developed by Johns Hopkins University’s Human Language Technology Center of Excellence. It adds multiple kinds and instances of automatically-generated syntactic, semantic and coreference annotations to The New York Times Annotated Corpus (LDC2008T19).

Concrete is a schema for representing structured, hierarchical and overlapping linguistic annotations. This release provides multiple tool outputs producing the same annotation types as different annotation theories under a shared tokenization.

Data

Concretely Annotated New York Times contains all of the 1.8 million articles in The New York Times Annotated Corpus. Those articles were written and published by the New York Times between January 1, 1987 and June 19, 2007; the 2008 corpus also includes metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com.

The following layers of annotation were added by processing the articles under the Concrete schema:

  • Segmented sentences and Penn Treebank-style tokenized words
  • Treebank-style constituent parse trees
  • Four different syntactic dependency trees
  • Named entities
  • Part of speech tags
  • Lemmas
  • In-document entity coreference chains
  • Three different frame semantic parses

See analytics.pdf for the list of tools used to create those annotations.

The data is stored in a binary form called Concrete, which is based on Apache Thrift. Concrete can be read and written in many common programming languages, such as Java, Python, Javascript and C++. Concrete also includes a number of utilities to access and view the data in human-readable forms.

The original NITF (News Industry Text Format) document structure in The New York Times Annotated Corpus was preserved in this Concrete version.

Production Date:April 16, 2018
Producer:Linguistic Data Consortium (LDC), University of Pennsylvania
Distributor:Linguistic Data Consortium (LDC), University of Pennsylvania
hdl:11272/FGDLB
0 downloads
Last Released: May 4, 2018
 
 
Abacus Dataverse Network - British Columbia Research Library Data Services - Hosted at the University of British Columbia © 2018