The UBC Library data collection, which includes all data from common data sources for easy, one-stop data shopping.
UBC Library Data Services
Sort By:
Studies: 1954 | Downloads: 45798

The population of Metro Vancouver (20110729Regional Growth Strategy Projections Population, Housing and Employment 2006 – 2041 File) will have increased greatly by 2040, and finding a new source of reservoirs for drinking water (2015_ Water Consumption_ Statistics File) will be essential. This issue of drinking water needs to be optimized and estimated (Data Mining file) with the aim of developing the region. Three current sources of water reservoirs for Metro Vancouver are Capilano, Seymour, and Coquitlam, in which the treated water is being supplied to the customer. The linear optimization (LP) model (Optimization, Sensitivity Report File) illustrates the amount of drinking water for each reservoir and region. In fact, the B.C. government has a specific strategy for the growing population till 2040, which leads them toward their goal. In addition, another factor is the new water source for drinking water that needs to be estimated and monitored to anticipate the feasible water source (wells) until 2040. As such, the government will have to make a decision on how much groundwater is used. The goal of the project is two steps: (1) an optimization model for three water reservoirs, and (2) estimating the new source of water to 2040.

The process of data analysis for the project includes: the data is analyzed with six software—Trifacta Wrangler, AMPL, Excel Solver, Arc GIS, and SQL—and is visualized in Tableau. 1. Trifacta Wrangler Software clean data (Data Mining file). 2. AMPL and Solver Excel Software optimize drinking water consumption for Metro Vancouver (data in the Optimization and Sensitivity Report file). 3. ArcMap collaborates the raw data and result of the optimization water reservoir and estimating population till 2040 with the ArcGIS software (GIS Map for Tableau file). 4. Visualizing, estimating, and optimizing the source of drinking water for Metro Vancouver until 2040 with SQL software in Tableau (export tableau data file).

Production Date:November 23, 2017
Producer:University of British Columbia (UBC), Master of Engineering Leadership
Last Released: Nov 24, 2017

Orthorectified aerial imagery of the UBC Vancouver campus, 2017. Ortho Pixel size - 10 cm

Production Date:July 04, 2017
Producer:University of British Columbia. Campus and Community Planning. (UBC), University of British Columbia
Distributor:University of British Columbia (UBC), University of British Columbia
Last Released: Nov 24, 2017

The 2016 Census Geographic Attribute File contains information at the dissemination block level, based on 2016 Census standard geographic areas. The data available include population counts, dwelling counts and land area. In addition, the 2016 Census Geographic Attribute File contains higher level standard geographic codes, names and, where applicable, types and classes. Data for higher level standard geographic areas can be derived by aggregating dissemination block-level data. The dissemination area representative point coordinates are also included in the 2016 Census Geographic Attribute File.

This version of the Geographic Attribute File is a dissemination block (DB)-level dataset which also includes data for the following 2016 Census standard geographic areas:

  • province and territory (PR)
  • economic region (ER)
  • census division (CD)
  • census consolidated subdivision (CCS)
  • census subdivision (CSD)
  • designated place (DPL)
  • federal electoral district (FED) (2013 Representation Order)
  • census metropolitan area (CMA), census agglomeration (CA) and census metropolitan in uenced zone (MIZ)
  • census tract (CT)
  • population centre (POPCTR) and rural area (RA)
  • aggregate dissemination area (ADA)
  • dissemination area (DA)

Production Date:March 03, 2017
Producer:Statistics Canada (Statcan)
Distribution Date:March 03, 2017
Distributor:Statistics Canada (Statcan)
Last Released: Nov 22, 2017

This series of cross-tabulations present a portrait of Canada based on the various census topics. They range in complexity and are available for various levels of geography.

Production Date:October 25, 2017
Producer:Statistics Canada (Statcan)
Distribution Date:October 25, 2017
Distributor:Statistics Canada (Statcan)
Last Released: Nov 20, 2017

No abstract available.

Production Date:1976
Producer:Statistics Canada
Distribution Date:November 19, 2009
Last Released: Nov 9, 2017
Abstract Meaning Representation (AMR) Annotation Release 2.0by Knight, Kevin; Badarau, Bianca; Baranescu, Laura; Bonial, Claire; Bardocz, Madalina; Griffitt, Kira; Hermjakob, Ulf; Marcu, Daniel; Palmer, Martha; O'Gorman, Tim; Schneider, Nathan

Abstract Meaning Representation (AMR) Annotation Release 2.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado’s Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 39,260 English natural language sentences from broadcast conversations, newswire, weblogs and web discussion forums.

AMR captures “who is doing what to whom” in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax.

LDC also released Abstract Meaning Representation (AMR) Annotation Release 1.0 (LDC2014T12).


The source data includes discussion forums collected for the DARPA BOLT and DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming from China Central TV, Wall Street Journal text, translated Xinhua news texts, various newswire data from NIST OpenMT evaluations and weblog data used in the DARPA GALE program. The following table summarizes the number of training, dev, and test AMRs for each dataset in the release. Totals are also provided by partition and dataset:

Dataset Training Dev Test Totals
BOLT DF MT 1061 133 133 1327
Broadcast conversation 214 0 0 214
Weblog and WSJ 0 100 100 200
BOLT DF English 6455 210 229 6894
DEFT DF English 19558 0 0 19558
Guidelines AMRs 819 0 0 819
2009 Open MT 204 0 0 204
Proxy reports 6603 826 823 8252
Weblog 866 0 0 866
Xinhua MT 741 99 86
Totals 36521 1368 1371 39260

For those interested in utilizing a standard/community partition for AMR research (for instance in development of semantic parsers), data in the “split” directory contains 39,260 AMRs split roughly 93%/3.5%/3.5% into training/dev/test partitions, with most smaller datasets assigned to one of the splits as a whole. Note that splits observe document boundaries. The “unsplit” directory contains the same 39,260 AMRs with no train/dev/test partition.

Production Date:June 15, 2017
Producer:Linguistic Data Consortium (LDC), University of Pennsylvania; SDL/Language Weaver, Inc; Computational Language and Educational Research Group, University of Colorado; Information Sciences Institute, University of Southern California
Distributor:Linguistic Data Consortium (LDC), University of Pennsylvania
1 download
Last Released: Nov 6, 2017

The Labour Force Survey provides estimates of employment and unemployment which are among the most timely and important measures of performance of the Canadian economy. With the release of the survey results only 10 days after the completion of data collection, the LFS estimates are the first of the major monthly economic data series to be released.

The Canadian Labour Force Survey was developed following the Second World War to satisfy a need for reliable and timely data on the labour market. Information was urgently required on the massive labour market changes involved in the transition from a war to a peace-time economy. The main objective of the LFS is to divide the working-age population into three mutually exclusive classifications - employed, unemployed, and not in the labour force - and to provide descriptive and explanatory data on each of these.

LFS data are used to produce the well-known unemployment rate as well as other standard labour market indicators such as the employment rate and the participation rate. The LFS also provides employment estimates by industry, occupation, public and private sector, hours worked and much more, all cross-classifiable by a variety of demographic characteristics. Estimates are produced for Canada, the provinces, the territories and a large number of sub-provincial regions. For employees, wage rates, union status, job permanency and workplace size are also produced. For a full listing and description of LFS variables, see the Guide to the Labour Force Survey (71-543-G), available through the "Publications" link above.

These data are used by different levels of government for evaluation and planning of employment programs in Canada. Regional unemployment rates are used by Employment and Social Development Canada to determine eligibility, level and duration of insurance benefits for persons living within a particular employment insurance region. The data are also used by labour market analysts, economists, consultants, planners, forecasters and academics in both the private and public sector.

Important note -- 4 August 2017

Labour Force Survey (LFS) data from January 2017 – July 2017 contained errors with numerical variables. Variables such as HRLYARN and UHRSMAIN were missing decimal place holders. As such, their values were off by a factor of 100. The issue has been addressed and the data for the year re-released

Production Date:January, 2017
Producer:Statistics Canada (Statcan)
Distribution Date:February, 2017
Distributor:Statistics Canada (Statcan)
116 downloads + analyses
Last Released: Nov 3, 2017
Metalogue Multi-Issue Bargaining Dialogueby Petukhova, Volha; Malchanau, Andrei; Oualil, Youssef; Klakow, Dietrich; Stevens, Christopher; Weerd, Harmen de; Taatgen, Niels

Metalogue Multi-Issue Bargaining Dialogue was developed by the Metalogue Consortium under the European Community’s Seventh Framework Programme for Research and Technological Development. This release consists of approximately 2.5 hours of semantically annotated English dialogue data that includes speech and transcripts.

The goal of the Metalogue project was to develop a dialogue system with flexible dialogue management to enable the system’s behavior in setting goals, choosing strategies and monitoring various processes. Participants were involved in a multi-issue bargaining scenario in which a representative of a city council and a representative of small business owners negotiated the implementation of new anti-smoking regulations. The negotiation involved four issues, each with four or five options. Participants received a preference profile for each scenario and negotiated for an agreement with the highest value based on their preference information. Negotiators were not allowed to accept an agreement with a negative value or to share their preference profiles with other participants.


Six unique subjects (undergraduates between 19 and 25 years of age) participated in the collection. The dialogue speech was captured with two headset microphones and saved in 16kHz, 16-bit mono linear PCM FLAC format. Speech signal files are of two types: full dialogue session; and segmented speech signal, cut per speaker and roughly per turn.

Transcripts were produced semi-automatically, using an automatic speech recognizer followed by manual correction.

Seven types of annotation were performed manually using the Anvil tool: dialogue act annotations; discourse structure acts; contact management acts; task management dialogue acts; negotiation moves; rhetorical relations; and disfluencies in speech production. More information about the annotation process is included in the documentation.

All text is presented in UTF-8 as either plain text or XML.

Production Date:July 18, 2017
Producer:Linguistic Data Consortium (LDC), University of Pennsylvania
Distributor:Linguistic Data Consortium (LDC), University of Pennsylvania
Last Released: Nov 2, 2017
KSUEmotionsby Meftah, Ali Hamid; Alotaibi, Yousef Ajami; Selouani, Sid-Ahmed

KSUEmotions was developed by King Saud University (KSU) and contains approximately five hours of emotional Modern Standard Arabic (MSA) speech from 23 subjects. Speakers were from three countries: Yemen, Saudi Arabia and Syria.

Subjects read MSA sentences from newswire text in the following emotions: neutral, anger, sadness, happiness, surprise, and interrogative (asking a question). Human reviewers then listened to the recordings to identify the emotion they heard…


Audio was recorded in each participant’s home. Audio is presented as 16-bit 16 kHz flac compressed wav. In addition to speech files and metadata about the speakers, timeless label files and automatic time segmentation alignment files are included. Text is presented as UTF-8 plain text.

Production Date:July 18, 2017
Producer:Linguistic Data Consortium (LDC), University of Pennsylvania
Distributor:Linguistic Data Consortium (LDC), University of Pennsylvania
Last Released: Nov 2, 2017
SRI-FRTIVby Shriberg, Elizabeth; Kathol, Andreas; Graciarena, Martin; Bratt, Harry; Kajarekar, Sachin; Jameel, Huda; Richey, Colleen; Goodman, Fred

SRI-FRTIV (Five-way Recorded Toastmaster Intrinsic Variation) was developed by SRI International in 2007-2008 and is comprised of approximately 232 hours of English speech from thirty-four speakers who were members of Toastmaster clubs. Participants were asked to speak at three different levels of effort (low, normal and high) in four different styles (interview, conversation, reading and oration) to study the question of how intrinsic variations -- associated with the speaker rather than the recording environment -- affect text-independent speaker verification.


Participants were native speakers of North American English who were members of local Toastmasters clubs and had experience in public speaking. This release includes demographic information for 30 speakers (15 male, 15 female), including gender, birth year, height, education level, years in Toastmasters, and a self-evaluation of speaking skills.

Not all effort levels were applicable for each speaking style and so were not collected. Interviews and phone conversations were not recorded at high effort and oration was not recorded at low or normal effort levels.

Speech data is presented as 16kHz 16-bit single channel flac compressed pcm wav (.flac).

Production Date:September 14, 2017
Producer:Linguistic Data Consortium (LDC), University of Pennsylvania
Distributor:Linguistic Data Consortium (LDC), University of Pennsylvania
Last Released: Nov 2, 2017
Abacus Dataverse Network - British Columbia Research Library Data Services - Hosted at the University of British Columbia © 2017