Text-mining

From HLWIKI Canada
Jump to: navigation, search
Statistics have a role in predictive modelling & clinical surveillance
Are you interested in contributing to HLWIKI International? contact: dean.giustini@ubc.ca

To browse other articles on a range of HSL topics, see the A-Z index.

Contents

Last Update

  • Updated.jpg 5 May 2016

Introduction

See also Big data | Bioinformatics | Data preservation | Data visualization | e-Science | ImpactStory | Open data | PubMed Alternative Interfaces | Semantic search

"...text mining is the indexing of content. Words that are part of a fixed vocabulary are found within a text and extracted to create an index that shows where in the text each word was found. The index can be used in the traditional way to locate the parts of the text that contain those words. The index can also be used as a database and analysed to discover patterns: for example, how often certain words occur. In simple terms, text mining is the process that turns text into data that can be analysed." — Clark, 2013

Text-mining (i.e., data mining, content-mining) refers to a process of discovering and extracting text-related content from unstructured, miscellaneous data. Text-mining is often mentioned in the context of several information-age trends such as big data, bioinformatics, data curation, e-Science and the semantic web. Currently, there are a number of social media monitoring tools that perform various types of text-mining activities. In 2013, the US Government announced that it extracts data from the e-mails and telephone calls of American citizens, referring to this process (which includes text-mining) as their metadata program.

Typically, text-mining comprises three major activities: 1) information retrieval (IR) to gather relevant unstructured text among heterogeneous databases, documents and websites, 2) information extraction (IE) to identify and extract entities, facts and relationships among those entities, and 3) data-mining to find associations among the information extracted in the various texts located. The goal of text-mining is to extract and discover knowledge hidden in text by identifying concepts, extracting facts/relationships in texts, discovering implicit links and generating hypotheses. One of the main reasons text-mining may be important is to deal with information overload created by blogs, wikis, clinical data, surveys, heterogeneous databases and the web. Text-mining is especially useful in areas where large collections of data and information in documents are located. Some of the scientific applications have been developed because of text-mining are drug discovery applications, predictive toxicology, competitive intelligence, patent searching, and so on.

Other reasons why text-mining may be important are:

  • Biomedical science is inundated with data, datasets and information of various kinds
  • Much of the information is in an unstructured format (text)
  • There are as many text types, genres, domains as there are documents
  • Some of the information is in a semi-structured format (XML + text)
  • Some of the information is in a structured format (databases)
  • Biomedical science researchers need to make sense of data
  • Biomedical researchers and health librarians need to manage this information and knowledge effectively
  • Text-mining can be used to improve indexing which is essential for findability; however, text-mining can create indexes more efficiently because it is machine-aided indexing
  • Text-mining is an important component of the semantic web, which Sir Tim Berners-Lee calls a database of linked data

Questions for librarians

The rise of data and its concomitant uses, curation and management, is a growing trend in academic libraries. However, rather than wait for your library organization to hire a data librarian or to create a data repository, why not try to introduce some data science skills (or exercises) into your library workshops?

  • First, how might you start to incorporate data science concepts into your information literacy programs?
  • Brainstorm and (re)write definitions, models and standards for our programs to include data
  • Develop discipline-based frameworks for information and data literacy
  • How should academic libraries provide data literacy education?
  • Should workshops be designed as standalone or integrated into courses?
  • Should they be part of research methods, theory courses or integrated across curricula?
  • Who should teach and support data literacy?
  • Data librarians, academic domain experts, LIS academics
  • Other subject experts

Benefits of text-mining

  • Text-mining can aid in systematically reviewing a large body of literature
  • Text-mining can help researchers keep up in their fields, reducing the risk they've missed something relevant
  • Text-mining aids in the discovery of patterns and trends in data, associations among entities, predictive rules, etc.
  • Text-mining has the ability to enrich unstructured text with semantic tags and annotations (i.e., see FOAF - Friend of a Friend)
  • Text-mining assists authors with tools to develop semantic annotations
  • Text-mining is a form of document and information management
  • Text-mining enables the enrichment of your digital libraries
  • Text-mining makes it easier for scientists to engage in intelligent searching, linking and integration of text, databases

Disadvantages of text-mining

  • Data collection in text-mining requires managing a lot of “free text”
  • Often, the data is ill-organized and not described in any way
  • The data is both semi-structured and unstructured
  • The use of natural language texts contains ambiguities and requires human intervention
  • There are lexical, syntactic, semantic and pragmatic ambiguities/ other challenges
  • Learning techniques for processing text typically need annotated training examples
  • Developing resources (ontologies, corpora) to improve text mining research is not a simple matter

Examples of text mining

  • BrainMap.Org is a database of published functional and structural neuroimaging experiments; the database can be analysed to study human brain function and structure
  • iCommons promotes collaboration among proponents of open education, open access publishing and free culture communities
  • iScience Maps is a free Web service for scientists interested in using Twitter content in their research
  • SureChem is a search engine for patents that allows chemists to search by chemical structure, chemical name, keyword or patent

See also GoPubMed | PubMed Alternative Interfaces | PubReMiner

Key websites & white papers

References

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox