Studies
Sort By:
Studies: 2091 | Downloads: 58470
GALE Phase 4 Arabic Broadcast News Speechby Walker, Kevin; Caruso, Christopher; Maeda, Kazuaki; DiPersio, Denise; Strassel, Stephanie
Description:

GALE Phase 4 Arabic Broadcast News Speech was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 37 hours of Arabic broadcast news speech collected in 2008 and 2009 by LDC and MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding transcripts are released as GALE Phase 4 Arabic Broadcast News Transcripts (LDC2018T14).

Broadcast audio for the GALE program was collected at LDC’s Philadelphia, PA USA facilities and at three remote collection sites: Hong Kong University of Science and Technology (Chinese), Medianet (Arabic) and MTC (Arabic). The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.

LDC’s local broadcast collection system is highly automated, easily extensible and robust and capable of collecting, processing and evaluating hundreds of hours of content from several dozen sources per day. The broadcast material is served to the system by a set of free-to-air (FTA) satellite receivers, commercial direct satellite systems (DSS) such as DirecTV, direct broadcast satellite (DBS) receivers, and cable television (CATV) feeds. The mapping between receivers and recorders is dynamic and modular. All signal routing is performed under computer control, using a 256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V format and are then processed to extract audio, to generate keyframes and compressed audio/video, to produce time-synchronized closed captions (in the case of North American English) and to generate automatic speech recognition (ASR) output. An overview of the system, the sources recorded and the configuration of the recording laboratory are contained in the Guidelines for Broadcast Audio Collection Version 3.0 included in this release.

LDC designed a portable platform for remote broadcast collection. This is a TiVO-style digital video recording (DVR) system that records two streams of A/V material simultaneously. It supports analog CATV (NTSC and PAL) and FTA DVB-S satellite programming and can operate outside of the United States. It has a small footprint, weighs less than 30 pounds and can be transported as carry-on luggage.

Medianet collected Arabic programming from across the Gulf region using its internal system and LDC’s portable broadcast collection platform installed in 2008. The portable platform deployed at the Medianet Tunisian collection facility collected multiple streams of regional Arabic programming from various sources. MTC collected Arabic programming using its internal collection system.

Data

The recordings in this release feature news broadcasts focusing principally on current events from the following sources: Abu Dhabi TV, a television station based in Abu Dhabi, United Arab Emirates; Al Arabiya, a news television station based in Dubai; Al Baghdadya , an Iraqi broadcast programmer; Alhurra, a U.S. government-funded regional broadcaster; Al Iraqiyah, an Iraqi television station; Aljazeera , a regional broadcaster located in Doha, Qatar; Al Ordiniyah, a national broadcast station in Jordan; Kuwait TV, a national broadcast station based in Kuwait; Radio Sawa, a U.S. government-funded regional broadcaster; Saudi TV, a national television station based in Saudi Arabia; Syria TV, the national television station in Syria; and Yemen TV, a television station based in Yemen.

This release contains 51 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings; as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded; and as a guide for data selection by retaining information about a program’s genre, data type and topic.

hdl:11272/TSKSR
0 downloads
Last Released: Jun 20, 2018
GALE Phase 4 Arabic Broadcast News Transcriptsby Glenn, Meghan; Lee, Haejoong; Strassel, Stephanie; Maeda, Kazuaki
Description:

GALE Phase 4 Arabic Broadcast News Transcripts was developed by the Linguistic Data Consortium (LDC) and contains transcriptions of approximately 37 hours of Arabic broadcast news speech collected in 2008 and 2009 by the Linguistic Data Consortium (LDC), MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding audio data is released as GALE Phase 4 Arabic Broadcast News Speech (LDC2018S05).

The recordings for transcription feature news broadcasts focusing principally on current events from the following sources: Abu Dhabi TV, a television station based in Abu Dhabi, United Arab Emirates; Al Arabiya, a news television station based in Dubai; Al Baghdadya , an Iraqi broadcast programmer; Alhurra, a U.S. government-funded regional broadcaster; Al Iraqiyah, an Iraqi television station; Aljazeera, a regional broadcaster located in Doha, Qatar; Al Ordiniyah, a national broadcast station in Jordan; Kuwait TV, a national broadcast station based in Kuwait; Radio Sawa, a U.S. government-funded regional broadcaster; Saudi TV, a national television station based in Saudi Arabia; Syria TV, the national television station in Syria; and Yemen TV, a television station based in Yemen.

Data

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 204,735 tokens. The transcripts were created with the LDC tool XTrans, which supports manual transcription and annotation of audio recordings. XTrans is available from the following link, https://www.ldc.upenn.edu/language-resources/tools/xtrans/downloads.

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC’s quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release. QTR transcription consists of quick (near-) verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. It does not include sentence unit annotation. QRTR annotation adds structural information such as topic boundaries and manual sentence unit annotation to the core components of a quick transcript. Files with QTR as part of the filename were developed using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.

hdl:11272/G0A6N
0 downloads
Last Released: Jun 20, 2018
Rhythm and Pitchby Dilley, Laura C.; Breen, Mara; Brown, Meredith; Gibson, Edward
Description:

Rhythm and Pitch contains approximately 27 minutes of spontaneous English conversations and radio news stories annotated with the Rhythm and Pitch (RaP) scheme. Speech data for annotation was taken from two corpora released by LDC, CALLHOME American English Speech (LDC97S42) and Boston University Radio Speech Corpus (LDC96S36).

The RaP system permits the capture of both intonational and rhythmic aspects of speech. Four labeling tiers are used for annotating speech prosody. These tiers carry information about the syllabic organization and orthography of the speech, its rhythmic structure, tonal patterns, and other information. More information about the RaP system is available on the RaP homepage.

Data

Speech data are presented as flac compressed 16-bit wav files. The Boston data are one channel 16kHz files, while the CALLHOME data are either one or two channel 8kHz files. Annotations are UTF-8 encoded Praat TextGrids.

hdl:11272/FZBLL
0 downloads
Last Released: Jun 20, 2018
Description:

The "Direction Of Trade Statistics" data file contains annual, quarterly, and monthly IMF time series of country-level distributions of exports and imports, from 1948 to the latest received -- currently December 1997.

hdl:11272/VG4LS
0 downloads
Last Released: Jun 20, 2018
Description:

Dissipation Rate (\epsilon, units: [W/kg]) of Turbulent Kinetic Energy, derived from microstructure measurements of velocity shear and temperature gradients. Shear and temperature-gradient turbulence spectra are also included, as are CTD measurements and key glider variables. All measurements are taken from a Slocum G2 ocean glider.

hdl:11272/10613
0 downloads
Last Released: Jun 15, 2018
Description:

Canada’s rapidly changing demographic profile, along with its accompanying social and economic issues, has led to much discussion concerning the relationship between work, lifestyle and well-being. Gauging the quality of life at work can help diagnose issues relating to productivity, morale, efficiency and equity. Charting patterns of home and leisure activities can take the temperature of Canadian culture. Bringing these two together will provide insight on the health and well-being of Canadians as they meet the challenges of the future.

The General Social Survey Program’s new cycle,Canadians at Work and Home, takes a comprehensive look at the way Canadians live by incorporating the realms of work, home, leisure, and overall well-being into a single unit. Data users have expressed a strong interest in knowing more about the lifestyle behaviour of Canadians that impact their health and well-being both in the workplace and at home. The strength of this survey is its ability to take diverse information Canadians provide on various facets of life and combine them in ways not previously possible with surveys that covered one main topic only.

The survey includes a multitude of themes. In the work sphere, it explores important topics such as work ethic, work intensity and distribution, compensation and employment benefits, work satisfaction and meaning, intercultural workplace relations, and bullying and harassment. On the home front, questions include family activity time, the division of labour and work-life balance. The survey also covers eating habits and nutritional awareness, the use of technology, sports and outdoor activities, and involvement in cultural activities. New-to-GSS questions on purpose in life, opportunities, life aspirations, outlook and resilience complement previously asked ones on subjective well-being, stress management and other socioeconomic variables.

Within Canada, all levels of government, academics and not-for-profit organizations have expressed interest in the results. Data from this survey will assist with program and policy decisions and research of all kinds interested in exploring the workplace, home life and leisure activities of Canadians from all areas of life. In addition, some of the data from this survey will be comparable internationally.

hdl:11272/10612
0 downloads + analyses
Last Released: Jun 12, 2018
Description:

LFS data are used to produce the well-known unemployment rate as well as other standard labour market indicators such as the employment rate and the participation rate. The LFS also provides employment estimates by industry, occupation, public and private sector, hours worked and much more, all cross-classifiable by a variety of demographic characteristics. Estimates are produced for Canada, the provinces, the territories and a large number of sub-provincial regions. For employees, data on wage rates, union status, job permanency and establishment size are also produced.

These data are used by different levels of government for evaluation and planning of employment programs in Canada. Regional unemployment rates are used by Employment and Social Development Canada to determine eligibility, level and duration of insurance benefits for persons living within a particular employment insurance region. The data are also used by labour market analysts, economists, consultants, planners, forecasters and academics in both the private and public sector.

hdl:11272/10575
28 downloads + analyses
Last Released: Jun 12, 2018
Monitoring changes in the Gene Ontology and their impact on genomic data analysis.by Jacobson, Matthew; Sedeño-Cortés, Adriana Estela; Pavlidis, Paul
Description:

Data and analysis of Gene Ontology annotations, to support reproducibility of results presented in the above cited preprint. There are two major parts to the data. The first is an analysis of the contents of the database supporting https://gotrack.msl.ubc.ca/ and represents direct downloads of files from that site at the time of our analysis. The second, concerning the analysis of the effects of changes in GO over time on enrichment analysis, includes python scripts and intermediate data and analysis files.

hdl:11272/10596
1 download
Last Released: Jun 1, 2018
Description:

These data include biological (i.e. animal counts) and environmental data from three remotely imagery surveys in Saanich Inlet, a fjord incising Vancouver Island, British Columbia, Canada. All three dives occurred in 2016. Data are organized by row, with each row representing a second of ROV video with metadata related to the ROV dive, water column data (e.g. temperature, salinity, dissolved oxygen, etc.) and species data (animal counts or binary presence/absence tallies). Data was collected as part of an MSc thesis (Gasbarro) and a Canadian Healthy Oceans Network II project (2.1.3).

hdl:11272/10609
1 download
Last Released: May 31, 2018
Description:

This dataset contains three types of data associated with Gasbarro et al. (2018) Marine Ecology Progress Series, collected from remotely operated vehicle imagery surveys in Douglas Channel, British Columbia, Canada. Biological data (i.e. animal abundance and size measurements) from photoquadrats, ROV/water column data (e.g. temperature, depth, dissolved oxygen, etc.) from four dives, and seasonal along-channel current data from a year-long deployment of two moored Acoustic Doppler Current Profilers comprise the three data types. Data was collected as part of an MSc thesis (Gasbarro).

hdl:11272/10608
0 downloads
Last Released: May 30, 2018
 
 
Abacus Dataverse Network - British Columbia Research Library Data Services - Hosted at the University of British Columbia © 2018