Data management

From HLWIKI Canada
(Redirected from Data curation)
Jump to: navigation, search
Does the rise in data management represent the move into a 4th "data-intensive" paradigm?
Are you interested in contributing to HLWIKI International? contact:

To browse other articles on a range of HSL topics, see the A-Z index.


Last Update

  • Updated.jpg This entry is out of date, and will not be updated, August 2018


See also Bioinformatics | Data management portal | Data preservation | e-Science | Open data | Research Portal for Academic Librarians | Semantic web | Text-mining

"... data curation is the "active and on-going management of data through its lifecycle of interest and usefulness to scholarly and educational activities across the sciences, social sciences, and humanities ... it is an emerging field that brings new opportunities and challenges for libraries. The growing movement to effectively manage, archive, preserve, retrieve and reuse research data is one that compliments traditional library missions ..." — CIRSS, Data Curation Education Program 2012

Data management (also research data management, data curation and data science) refers to maintaining accessibility, storage and preservation of data for research purposes. Authors writing in the field say curation of data involves the selection and appraisal of data similar to the archiving of documents and records (which includes selection and appraisal). Data management covers a range of issues such as intellectual access, redundant storage, data transformation and preservation. According to Stuart (2010) "...if we [plan] to be relevant in the age of Google we must move beyond documents, and facilitate access to increased amounts of data on the web. ..." Academic libraries have taken responsibility to coordinate data and included it as part of their long-term mandate. Global research in health and medicine, including clinical trials, is born-digital and increasingly data-driven.

Issues & concerns about data

Most researchers are not trained on how to manage digital data, let alone deal with data policies, digital preservation, data sharing or the systems that make sharing possible. Data curation is also a concept worth exploring as every knowledge worker can do very practical things to make their data safer and more organized. All academic librarians should consider seeking courses and workshops to learn the fundamentals of data management, especially file management, naming and file organization systems and conventions, storage and backup practices, documentation, and promoting future file usability. These practices will help the average researcher with their data but are also useful to anyone with digital files.


The Fourth Paradigm is a term connected to e-data. This is the vision of pioneering computer scientist Jim Gray for a new fourth paradigm of discovery based on data-intensive science; the extensive monograph offers insights into how it can be fully realized. To take a look at a guide primarily geared toward researchers and data librarians, see here.

In 2006, Harvard University created the Dataverse Network Project' which is a "...repository for research data that takes care of long term preservation and good archival practices, while researchers can share, keep control of and get recognition for their data. Dataverse also supports the sharing of research data with a persistent data citation, and enables reproducible research.". A related data initiative at Harvard is the REDCap project a free, web-based, and user-friendly electronic data capture (EDC) tools for research studies. More recently, in 2013, the Council on Library and Information Resources published Research data management: principles, practices and prospects which outlines the emerging landscape of research data management responses and interventions in the United States. Wikidata, an offshoot of Wikipedia, is an interesting new data repository and feeds information for Wikipedia. For clinicians interested in tracking "missing data", see Missing Data UK.

What do we mean by research data and data management planning?

Data has enormous value if managed well, and made accessible. Research data may be defined as the information (e.g. data sets, microarray, numerical data, clinical trial information, textual records, images, sound, etc.) generated or used as quantitative evidence in primary biomedical research. This research data is distinguished by the fact that it is accepted by the research community as a means to validate research findings, observations and hypotheses. According to CARL/ABRC, the majority of research data produced by academic institutions in Canada is not being properly or systematically archived in repositories. This suggests that a more concerted effort is needed to bring together experts at Canadian academic institutions to initiate data management projects. A study conducted by the Social Sciences and Humanities Research Council (SSHRC) found that 3 Canadian organizations out of 110 systematically archive data - of those, all were archived in the US. Research data generated in Canada is not managed properly and much of it is under-utilized or inaccessible. While some disciplines and research areas have institutional, national and international supports for data curation, this support is neither comprehensive nor well-known.

Canadian projects & websites

  • DataCite Canada, NRC 800px-Flag of Canada.svg.png is an international collaboration to improve access to research data by enabling organizations to register datasets and digital object identifiers (DOIs). Research data is defined as any research output that has not been published before such as raw data, slide presentations, lab notes, etc. CISTI is responsible for assigning unique identifiers for Canadian data sets; however, CISTI is not ready to accept data sets; it does plan to assign DOIs to data and work with data centres in Canada interested in participating in DataCite.
  • Research Data Canada, Government of Canada 800px-Flag of Canada.svg.png
  • Research Data Canada is a collaborative effort to address the challenges and issues surrounding the access and preservation of data arising from Canadian research. This multi-disciplinary group of universities, institutes, libraries, granting agencies, and individual researchers has a shared recognition of the pressing need to deal with Canadian data management issues from a national perspective.

Data management courses

  • MANTRA (or Management Training) is an integral part of the University of Edinburgh's Research Data Management (RDM) programme. EDINA and the Data Library at the University of Edinburgh curate this resource, based on internal and external feedback. MANTRA is an open, web-based training course intended for self-paced learning by students and researchers or others who manage digital data. MANTRA informs about good practices in research data with real life stories. Librarians’ training needs are met through a companion resource, the DIY RDM Training Kit for Librarians.

Data literacy

  • "Data literacy must also include the ability to do something with raw information - to process it in some way. In an era where spreadsheets help us to make the grandest of decisions, we must have basic statistical literacy and fluency in the tools that allow us to make sense out of numerical data, not just words and ideas." ~ Johnson, "The Information Diet: A Case for Conscious Consumption"
  • Khan Academy. Statistics. 4-star.gif Us flag.jpg basics of reading and interpreting data; descriptive and inferential statistics covered in an introductory course

International projects & websites

4-star.gif 4 stars denotes librarian-selected, high quality information. Starred sites are great places to begin your research.
Managing data is central to health care
  • CSAIL looks at the issue of big data as "fundamentally multi-disciplinary"; the MIT team includes faculty and researchers across related technology areas, including algorithms, architecture, data management, machine learning, privacy and security, user interfaces, and visualization; as well as domain experts in finance, medical, smart infrastructure, education and science
  • Databib is a tool for helping people identify and locate online repositories of research data
  • helping you to find, access and use data
  • DataCite Canada's services are offered in cooperation with DataCite, an international consortium of national-scale libraries and research organizations committed to increasing access to research data on the Internet
  • DataCite Canada is DataCite's DOI allocation agent for Canada
  • DataCite promotes the value of data archiving, citation and discoverability within Canada
  • table lists NIH-supported data repositories that accept submissions of appropriate data from NIH-funded investigators (and others). Also included are resources that aggregate information about biomedical data and information sharing systems
  • DMPTool adheres to National Institutes of Health (NIH) data sharing requirements
  • DMPTool provides step-by-step guidance to help users create ready-to-use data management plans and meet funder data management requirements. While anyone can create an account and use this resource, many institutions have partnered with the DMPTool to allow login through their home institution, and, in some cases have provided customized help and support
  • Dryad is an international repository of data underlying peer-reviewed articles in the basic and applied biosciences. Dryad enables scientists to validate published findings, explore new analysis methodologies, repurpose data for research questions unanticipated by the original authors, and perform synthetic studies. Dryad also aims to make data archiving as simple as possible via a suite of services not necessarily provided by publishers or institutional websites.
  • Thelwall M, Kousha K. Do journal data sharing mandates work? Life sciences evidence from Dryad. Aslib Journal of Information Management. 2017 Jan 16;69(1).
  • a collaborative project devoted to educating science and medical librarians on e-Science, the portal was initiated at the University of Massachusetts Medical School through funding from the National Network of Libraries of Medicine
  • a vision of pioneering computer scientist Jim Gray for a new fourth paradigm of discovery based on data-intensive science; this extensive monograph offers insights into how it can be fully realized
  • U.S. federal government initiatives to make data more accessible for monitoring, assessment and policy development
  • access to high quality data improves understanding of a community’s health status and determinants
  • provide a single, user-friendly, source for national, state, and community health indicators
  • minimize duplication of effort in provision of digital preservation training and education programmes
  • describe, promote and contextualize current training and education offerings
  • identify and exploit collaborative training and education opportunities
  • maximize inter-disciplinary training and education opportunities
  • develop a shared digital preservation training infrastructure to enable reuse of training and education materials
  • ensure synergy and complementarity between emerging curation and preservation education programmes with professional development training courses
  • a research and teaching unit at Harvard University dedicated to exploring and expanding the frontiers of networked culture in the arts and humanities
  • a social web site for researchers sharing research objects such as scientific workflows
  • aims to solve name ambiguity problem in scholarly communications by creating a registry of persistent unique identifiers for individual researchers and an open and transparent linking mechanism between ORCID, other ID schemes, and research objects such as publications, grants, and patents
  • aimed at helping researchers share biomedical data and models; PhysiomeSpace has just completed its beta implementation and is open to users
Data Curation Continuum (Treloar, 2007)
  • centralized, standards compliant, public repository for proteomics data; developed to provide proteomics community with a repository for protein and peptide identification with evidence supporting it; details of post-translational modifications coordinated relative to peptides in which they have been found also
  • Find, use and share numerical data; search over 8,000,000 financial, economic, and social datasets

  • Need to create a data plan for a grant proposal? Find out what to include & see examples.
  • Wolfram Alpha provides access to a world of factual data, without searching, calling itself the first computational knowledge engine. On the web, there is increased emphasis on repositories of data maintained by national or international agencies, organizations and individuals. Wolfram Alpha now hosts the Wolfram Data Summit to bring together those responsible for data repositories and to develop innovative concepts for the future.
  • provide all users with improved access to World Bank data and to make that data easy to find and use

Data Information Literacy at Purdue Us flag.jpg

In partnership with librarians at the University of Minnesota, University of Oregon and Cornell University, the Purdue University Libraries received $250,000 from IMLS to develop programs for the next generation of scientists to enable them to find, organize and share data. The program is intended for graduate students in science working their way toward careers as research scientists. In 2012, technology makes it easier to share research data beyond the lab. In many cases, data is not administered in ways that enable it to be easily discovered, understood, or re-purposed by others. This training is vital to scientists as they look to secure research funding. The National Science Foundation issued a report in 2007 on the need to build public collections of research data; since 2011, it has required scientists to include data management plans in their grant applications.

The Data Information Literacy effort will be carried out over two-years by five teams. Two teams, consisting of a data librarian, subject librarian and faculty researcher, are based at Purdue, with one team each at the other institutions. Teams are constructed to represent various subjects from computer engineering to landscape architecture so commonalities and differences in data curation can be explored. Each team will conduct an assessment of data needs for their discipline, including interviewing and observing researchers. Teams will develop and implement targeted instruction and assess the impact of that instruction in developing the data information literacy skills of graduate students.

More information on the data information literacy project is available at

See also Indiana University-Purdue University Indianapolis. Data Services Program

Data storage costs and data curation in libraries


Personal tools