Studies
Sort By:
Studies: 1977 | Downloads: 37180
UBC Research Data Management Survey: Health Sciencesby Barsky, Eugene; Brown, Helen; Ellis, Ursula; Ishida, Mayu; Janke, Robert; Menzies, Erin; Miller, Katherine; Mitchell, Marjorie; Vis-Dunbar, Mathew
Description:

In 2016, the Canadian federal funding agencies introduced the Tri-Agency Statement of Principles on Digital Data Management, which advocates for developing data management plans (DMPs) and making data available for future research. A data management plan addresses questions about: research data types and formats, metadata standards, ethics and legal compliance, data storage and reuse, assignment of data management responsibilities, and resource requirements. With anticipation that DMPs will be increasingly required in grants applications, librarians at University of British Columbia surveyed researchers about their RDM practices and needs in three phases, each of which targets different disciplines: 1) the Sciences and Engineering (fall 2015), 2) the Social Sciences and Humanities (fall 2016), and 3) the Health Sciences (spring 2017). The surveys illuminate disciplinary differences in RDM, and will inform the University in developing infrastructure and services to support researchers in RDM. This report describes findings from the third survey at UBC targeting researchers in the Health Sciences.

hdl:11272/10491
3 downloads
Last Released: May 18, 2017
Multi-Language Conversational Telephone Speech 2011 -- Turkishby Jones, Karen; Graff, David; Walker, Kevin; Strassel, Stephanie
Description:

Introduction

Multi-Language Conversational Telephone Speech 2011 -- Turkish was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 18 hours of telephone speech in Turkish.

The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). LRE 2011 focused on language pair discrimination for 24 languages/dialects, some of which could be considered mutually intelligible or closely related.

LDC has released the following as part of the Multi-Language Conversation Telephone Speech 2011 series:

  • Slavic Group (LDC2016S11)
  • Turkish (LDC2017S09)

Data

Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance. The data was collected using LDC's telephone collection infrastructure, comprised of three computer telephony systems. Human auditors labeled calls for callee gender, dialect type and noise. Demographic information about the participants was not collected.

All audio data are presented in FLAC-compressed MS-WAV (RIFF) file format (*.flac); when uncompressed, each file is 2 channels, recorded at 8000 samples/second with samples stored as 16-bit signed integers, representing a lossless conversion from the original mu-law sample data as captured digitally from the public telephone network. The following table summarizes the total number of calls, total number of hours of recorded audio, and the total size of compressed data:

group lng #calls #hours #MB
turkish tur 87 18.6 975

hdl:11272/ACOKK
0 downloads
Last Released: May 18, 2017
Phrase Detectives Corpusby Chamberlain, Jon; Poesio, Massimo; Kruschwitz, Udo
Description:

Introduction

Phrase Detectives Corpus was developed by the School of Computer Science and Electronic Engineering at the University of Essex and consists of approximately 19,012 words across 40 documents anaphorically-annotated by the Phrase Detectives game, an online interactive "game-with-a-purpose" (GWAP) designed to collect data about English anaphoric coreference.

GWAPs for creating language resources are growing. In general, they employ non-monetary incentives, such as entertainment, to motivate participation and can be successful for large-scale persistent annotation efforts.

Data

The documents in the corpus are taken from Wikipedia articles and from narrative text in Project Gutenberg. Wikipedia articles and annotation files are presented as XML and Project Gutenberg source files are presented as plain text. All text is encoded as UTF-8. Annotations are comprised of a gold standard version created by multiple experts, as well as a set created by a large non-expert crowd (via the Phase Detectives game).

The data was annotated according to a prevalent linguistically-oriented approach for anaphora used in several tasks, including OntoNotes Release 5.0 (LDC2013T19), SemEval-2010 Task 1 Ontonotes English: Coreference Resolution in Multiple Languages (LDC2011T01) and The ARRAU Corpus of Anaphoric Information (LDC2013T22).

hdl:11272/QCTAL
0 downloads
Last Released: May 18, 2017
The EventStatus Corpusby Huang, Ruihong; Jurafsky, Daniel; Riloff, Ellen
Description:

Introdution

The EventStatus Corpus was developed by researchers at Texas A&M University, Stanford University and The University of Utah. It consists of approximately 3,000 English and 1,500 Spanish news articles about civil unrest events annotated with temporal tags.

This corpus was designed to support the study of the temporal and aspectual properties of major events, that is, whether an event has already happened, is currently happening or may happen in the future. Since it focuses on a single domain (civil unrest events), it may be appropriate for tasks such as event extraction and temporal question answering.

Data

The relevant news articles were sourced from English Gigaword Fifth Edition (LDC2017T07) and Spanish Gigaword Third Edition (LDC2011T12). The civil unrest events include protests, demonstrations, marches and strikes. The data was annotated as PAST, ON-GOING or FUTURE and within each of those categories, as PLANNED, ALERT or POSSIBLE.

In addition to the annotated articles, file lists used in experiments for tuning and test are included. 10-fold cross-validations were performed, and the specific 10-fold splits of the test are included as well. All text is presented as plain text and encoded in UTF-8.

hdl:11272/UICPL
0 downloads
Last Released: May 18, 2017
Description:

CanMap® Address Points are unique and discrete representations of civic address assignments across Canada. It is the ultimate in answering the question of “where” and an anchor for a single source of accuracy in your mission-critical data. When building your location intelligence solution, this component can represent the single most important geometry feature providing high precision to your application. Benefits:

  • Enhance your corporate location intelligence capabilities
  • Provide highest precision coordinates for geocoding
  • Optimize current automated workflows and reduce costly secondary manual processes
  • Provide clear identification of actual addresses within a defined zone of interest
  • Gain new competitiveness by powering your analytics with rooftop precision data features
  • Enhance your end-user experience by getting them curbside to their exact destination With urban and rural coverage across Canada, the point feature is the ultimate in high-confidence, high accuracy geographic representations demarcating the physical location of an address.

hdl:11272/10487
4 downloads
Last Released: May 9, 2017
Description:

CanMap® Address Points are unique and discrete representations of civic address assignments across Canada. It is the ultimate in answering the question of “where” and an anchor for a single source of accuracy in your mission-critical data. When building your location intelligence solution, this component can represent the single most important geometry feature providing high precision to your application. Benefits:

  • Enhance your corporate location intelligence capabilities
  • Provide highest precision coordinates for geocoding
  • Optimize current automated workflows and reduce costly secondary manual processes
  • Provide clear identification of actual addresses within a defined zone of interest
  • Gain new competitiveness by powering your analytics with rooftop precision data features
  • Enhance your end-user experience by getting them curbside to their exact destination With urban and rural coverage across Canada, the point feature is the ultimate in high-confidence, high accuracy geographic representations demarcating the physical location of an address.

hdl:11272/10486
6 downloads
Last Released: May 9, 2017
Description:

The Labour Force Survey provides estimates of employment and unemployment which are among the most timely and important measures of performance of the Canadian economy. With the release of the survey results only 10 days after the completion of data collection, the LFS estimates are the first of the major monthly economic data series to be released.

The Canadian Labour Force Survey was developed following the Second World War to satisfy a need for reliable and timely data on the labour market. Information was urgently required on the massive labour market changes involved in the transition from a war to a peace-time economy. The main objective of the LFS is to divide the working-age population into three mutually exclusive classifications - employed, unemployed, and not in the labour force - and to provide descriptive and explanatory data on each of these.

LFS data are used to produce the well-known unemployment rate as well as other standard labour market indicators such as the employment rate and the participation rate. The LFS also provides employment estimates by industry, occupation, public and private sector, hours worked and much more, all cross-classifiable by a variety of demographic characteristics. Estimates are produced for Canada, the provinces, the territories and a large number of sub-provincial regions. For employees, wage rates, union status, job permanency and workplace size are also produced. For a full listing and description of LFS variables, see the Guide to the Labour Force Survey (71-543-G), available through the "Publications" link above.

These data are used by different levels of government for evaluation and planning of employment programs in Canada. Regional unemployment rates are used by Employment and Social Development Canada to determine eligibility, level and duration of insurance benefits for persons living within a particular employment insurance region. The data are also used by labour market analysts, economists, consultants, planners, forecasters and academics in both the private and public sector.

hdl:11272/10439
23 downloads + analyses
Last Released: May 5, 2017
Transcriptomic correlates of neuron electrophysiological diversityby Tripathy, Shreejoy; Toker, Lilah; Li, Brenna; Crichlow, Cindy-Lee; Tebaykin, Dmitry; Mancarci, Ogan; Shreejoy Tripathy
Description:

How neuronal diversity emerges from complex patterns of gene expression remains poorly understood. Here we present an approach to understand electrophysiological diversity through gene expression by integrating transcriptomics with intracellular electrophysiology. Using a brain-wide dataset of 34 neuron types, we identified 653 genes whose expression levels significantly correlated with variability in one or more of 11 electrophysiological parameters. The majority of these correlations were further consistent in an independent sample of 12 visual cortex cell types. Many associations reported here have the potential to provide new insights into how neurons generate functional diversity, and correlations of ion channel genes like Gabrd and Scn1a (Nav1.1) with resting potential and spiking frequency are consistent with known causal mechanisms. These results suggest that despite the complexity linking gene expression to electrophysiology, there are likely some general principles that govern how individual genes establish phenotypic diversity across very different cell types.

hdl:11272/10485
1 download
Last Released: May 4, 2017
2010 NIST Speaker Recognition Evaluation Test Setby Greenberg, Craig; Martin, Alvin; Graff, David; Brandschain, Linda; Walker, Kevin
Description:

Introduction

2010 NIST Speaker Recognition Evaluation Test Set was developed by the Linguistic Data Consortium (LDC) and NIST (National Institute of Standards and Technology). It contains 2,255 hours of American English telephone speech and speech recorded over a microphone channel involving an interview scenario used as test data in the NIST-sponsored 2010 Speaker Recognition Evaluation (SRE).

The ongoing series of SRE yearly evaluations conducted by NIST are intended to be of interest to researchers working on the general problem of text independent speaker recognition. To this end the evaluations are designed to be simple, to focus on core technology issues, to be fully supported and to be accessible to those wishing to participate.

The 2010 evaluation was similar to the 2008 evaluation by including in the training and test conditions for the core test not only conversational telephone speech (CTS) recorded over ordinary telephone channels, but also CTS and conversational interview speech recorded over a room microphone channel. Unlike prior evaluations, some of the conversational telephone style speech was collected in a manner to produce particularly high, or particularly low, vocal effort on the part of the speaker of interest.

Data

The speech recordings in this release were collected in 2009 and 2010 by LDC at its Human Subjects Collection facility in Philadelphia. This collection was part of the Mixer 6 project, which was designed to support the development of robust speaker recognition technology by providing carefully collected and audited speech from a large pool of speakers recorded simultaneously across numerous microphones.

The telephone speech segments include two-channel excerpts of approximately 10 seconds and 5 minutes. There are also summed-channel excerpts in the range of 5 minutes. The microphone excerpts are 3-15 minutes in duration. As in prior evaluations, intervals of silence were not removed. The data included in this release is 8 bit ulaw with a sample rate of 8000.

In addition to evaluation data, this package also consists of answer keys, trial and train files, development data and evaluation documentation.

hdl:11272/V7OXL
0 downloads
Last Released: Apr 28, 2017
CHiME2 Gridby Vincent, Emmanuel; Barker, Jon; Watanabe, Shinji; Le Roux, Jonathan; Nesta, Francesco; Matassoni, Marco
Description:

Introduction

CHiME2 Grid was developed as part of The 2nd CHiME Speech Separation and Recognition Challenge and contains approximately 120 hours of English speech from a noisy living room environment. The CHiME Challenges focus on distant-microphone automatic speech recognition (ASR) in real-world environments.

CHiME2 Grid reflects the small vocabulary track of the CHiME2 Challenge. The target utterances were taken from the Grid corpus and consist of 34 speakers reading simple 6-word sequences.

Data

Data is divided into training, development and test sets. All data is provided as 16 bit WAV files sampled at 16 kHz. The noisy utterances are provided both in isolated form and in embedded form. The latter either involve five seconds of background noise before and after the utterance (in the training set) or they are mixed in continuous five minute noise background recordings (in the development and test sets). Seven hours of noise background not part of the training set are also included. The data is accompanied by one annotation file per speaker that includes additional technical information.

Also included is a baseline Hidden Markov Model (HMM)-based speech recogniser and a scoring tool designed for the 2nd CHiME Challenge to allow users to obtain keyword recognition scores from formatted result files, perform recognition and score the challenge data, and estimate parameters of speaker dependent HMMs.

hdl:11272/GJ8WY
0 downloads
Last Released: Apr 27, 2017
 
 
Abacus Dataverse Network - British Columbia Research Library Data Services - Hosted at the University of British Columbia © 2017