Big data

Big data's role in predictive modelling & clinical surveillance
See also Altmetrics | Bioinformatics | Data science | Data management portal | e-Science | Open data | Semantic search | Text-mining

"‘Big data’ is dead. What’s next? ...various vendors are using the term "smart data"
"Big data refers to the massive amounts of data around us, which can be aggregated and measured by technological advances in micro- and nano-electronics, nano materials, interconnectivity telecommunication infrastructure, massive network-attached storage capabilities, and commodity-based high-performance computing. All credit card transactions, cell phone traffic, e-mail traffic, video and images from networks of surveillance devices, satellite and ground sensing data for weather and climate, now generate massive data and information. Personal health information related to genome sequencing and extensive imaging in medicine has driven a revolution in data analytics and predictive models that inform decision making whether identifying security threats or making diagnoses and treatment decisions for patients. — Schadt, 2012

“Big data” (also enterprise big data, smart data & even data science) is the buzzword or catch-phrase of 2015-2016 and for good reason. It appears everything is being digitized and as such, huge data sets are available to researchers and data scientists. How do researchers use this data? The idea of having data just a few clicks away is interesting but when it is not created in a way that is easily searchable or extractable, access is still problematic. Additionally, there are issues about ownership, management, preservation, and the rights the library offering it may or may not have regarding access.

In simple terms, big data refers to the tools, processes and procedures that permit the creation, manipulation and management of large data sets. Thus a new data paradigm has emerged which has broad application in research such as medical informatics, bioinformatics, genome searching and data-driven research where large volumes of data are transformed into knowledge. In some circles, big data is contested though it is used to refer to the availability and use of data broadly speaking, in structured and unstructured formats. At least one expert says that other terms are preferable such as data curation and data science. The "big data" era is a result of search and discovery technologies that can extract value from massive amounts of information. "Big data" is connected to health care, Silicon Valley, e-commerce and the private sector in that it is used to be competitive in predicting market growth. The prediction is that we create and replicate 2.8ZB — zettabytes, i.e. 2.8 million million gigabytes — of data and the ‘digital universe’ will reach 40ZB by 2020.

The debate about big data and its value in innovation and growth is prominent in the pharmaceutical and medical industries. It is associated with open data and text-mining, and linked to clinical and patient data harvestible from patient records and related health systems. Big data goes beyond the literature and refers to the vast stores of data in databases, especially clinical and research data in clinical trials, most of which waits to be mined. According to the McKinsey Institute, some major domains are implicated by big data: 1) healthcare in the United States, 2) the public sector in Europe, 3) the retail sectors in the United States, and 4) manufacturing and personal-location data globally. The Harvard Business Review has said that data scientist is an emerging field; McAfee says that "...big data is far more powerful than the analytics of the past. Executives can measure and therefore manage more precisely than ever before. They can make better predictions and smarter decisions." One of the related areas of big data in medicine is translational medicine; another is data fabrication. Ross & Krumholz argued in 2013 for sharing clinical trials data more openly.

See this editorial written by two UK medical librarians, Tattersall A, Grant MJ. Big Data - What is it and why it matters. Health Info Libr J. 2016 Jun;33(2):89-91.

Borgman on #bigdata and libraries

Big Data 100

The BigData Top100 List initiative is an open community-based effort for benchmarking big data systems.

Examples of Big Data products


Key websites & projects

  • The Data Citation Index from Thomson Reuters is the first source of data discovery for the sciences, social sciences and arts and humanities; DCI indexes leading data repositories of interest to the scientific community, including two million data studies and datasets
Apply open innovation to conceptualizing novel approaches to “big data” from various US government agencies, e.g., health, energy and earth sciences
  • Brings global community of data scientists, data technologies, data visualisers and data businesses together from commercial, financial, social and tech sectors
  • operated by the National Research Council of Canada with the support of the Canadian Space Agency
  • a starting point for large datasets in particle physics, astrophysics, etc
  • CSAIL looks at the issue of big data as "fundamentally multi-disciplinary"; the MIT team includes faculty and researchers across technology areas, including algorithms, architecture, data management, machine learning, privacy and security, user interfaces, and visualization; also domain experts in finance, medical, smart infrastructure, education and science
  • Databib is a tool for helping people identify and locate online repositories of research data
  • helping you to find, access and use data
  • The Institute for Empowering Long Tail Research is conceptualizing and building tools and services to help small laboratories make full use of their valuable research data
  • The website (and book) revolves around the idea that the planet is a developing nervous system of data that will have a greater impact on our lives than the Internet
  • videos about IBM's Big Data Initiative; demos, interviews, presentations, tutorials and more; according to IBM, big data spans four dimensions: volume, velocity, variety and veracity; see
  • A Wired article about the data scientist and data explorer
  • Research project which aims to understand how people make sense of big data visualizations
  • News and events about big data in a blog format


