To browse other articles on a range of HSL topics, see the A-Z index.
The data glossary is a work in progess. In any case, the definitions are here for general consumption and access. Feel free to correct any of those provided. Sign up for editing privileges by contacting the wiki administrator.
Administrative metadata refers to information used to manage or control access to an object or collection. This metadata may include information on how the object or collection is stored, how it was scanned, copyrights and licenses, and what is required for long-term preservation.
Anonymization refers to the practice of removing personally identifiable information (PII), i.e., any information that might be used to uniquely identify, contact or locate someone such as a social security number, email address, credit card or fixed IP address. In healthcare, this is particularly important due to the emphasis placed on data privacy.
Application programming interface or "API" is a software program that enables one computer program or application to exchange data with another
The American Standard Code for Information Interchange (ASCII) is a character-based scheme of encoding that was originally based on the English alphabet. ASCII encodes 128 specific characters: numbers 0-9, letters a-z and A-Z, basic punctuation, control codes that originated with Teletype machines, and a blank space into 7-bit binary integers. ASCII codes represent text in computers, communications equipment, and other devices that transmit text
Browsing is a type of exploring for information, typically using some logical or visual organization as an aid, as in the Internet or in a digital file or other media and collection of documents. There is some question as to whether browsing is even a function in data-related management.
A byte is a unit of digital information that is eight "bits" long (8 bits = 1 byte). Historically, the byte was the number of bits used to encode a single character of text on a computer and was the smallest unit of memory in computer architectures. The size of a byte has historically been hardware dependent and no definitive standard exists to mandate the size.
Values or observations that can be sorted into categories (or groups); bar charts and pie graphs are used to graph categorical data.
Similar platforms: ContentDM, DSpace, Eprints, Fedora, Digital Commons, Hydra, Greenstone.
A controlled vocabulary (or thesaurus) is a list of terms, and groups of related terms, where the most commonly-used form of a concept is stipulated within a database or index. Common semantic relations incorporated include: synonym, antonym, broader term (hyperonym), narrower term (hyponym), part-whole (meronyms and holonyms), used for, and related term. Thesauri (plural form) may be for a natural language or for a specialized domain, like Art and Architecture. In information retrieval, controlled terms are used to index articles and search large databases more consistently. A more general definition of a thesaurus is a list of synonyms or grouping of terms with similar meanings. In information retrieval, this list comprises a set of subject headings in a catalogue or database (http://www.library.cornell.edu/olinuris/ref/research/vocab.html). A thesaurus is used as a tool for vocabulary control (http://publish.uwo.ca/~craven/677/thesaur/main01.htm) where a pre-authorized list of terms governs the descriptions of items within a database. The purpose of vocabulary control is to ensure efficient and consistent searching post-hoc. Similar concepts are grouped under a single heading, such as business listings in a phonebook (http://www.controlledvocabulary.com/).
How does data apply within our copyright framework? see the full entry on copyright resources
Also a bot; a computer program which crawls an archive or the Internet or hyperbase in a methodical, automated manner. Yes bots crawl data.
Querying, searching, and accessing information written in a natural language different from the language of a user’s query. This type of retrieval may be critical in data management where information is in multiple languages.
Crosswalks are sets of rules that show how fields in different systems are related. These rules enable systems to interpret and translate data to different standards for sharing or conversion. A crosswalk can be a table such as the one below showing equivalent elements (or "fields") in more than one database; it maps elements in one metadata scheme to equivalent elements in another. The crosswalk below shows MARC standards to Dublin Core:
Curation refers to the organization and and maintenance of (digital) information and by extension data.
Data is a collective noun to describe discrete or individual pieces of information; it may be quantitative or qualitative in nature. In computing (data processing), data may be represented in tabular form (represented by rows and columns), in tree structures (sets of nodes with explicit parent-child-like relationships), or graphs (sets of connected nodes). Data can also be the result of measurements that require human judgement, and can be visualized using graphs or images. Data, information and knowledge are often seen in a continuous flow with data comprising the bits and bytes of information. Data becomes information through analysis eg., the height of an adult male/female is generally considered "data", measurements on human growth may be "information" but articles or books of this topic containing analysis and synthesis move it towards "knowledge". The philosophical discussion around data vis a vis information ultimately move towards what is knowledge? and further what is wisdom? Knowledge may be thought of as a synthesis of all information available about a subject whereas wisdom refers to the ability (gained through experience) to be selective about what information is most important.
A database about data and database structures. A catalogue of data elements, containing names, structures and information about usage, for the benefit of programmers and others interested in data elements and their use
The exercise of decision-making and authority for data-related matters; organizational bodies, rules, decision rights and accountabilities of people and information systems as they perform information-related processes. Data governance can help to determine how organizations make decisions
The DLI is a StatsCanada program that enables institutions to pay an annual subscription fee to access to public microdata files, databases and geographic files. Use limited to academic research and teaching (Data Services UBC Library).
A data library refers to the use of numeric, audio-visual, textual or geospatial data collections sets for secondary use in research. Often, data libraries are part of larger institutions (academic, corporate, scientific, medical, governmental, etc.) and were established to serve data users of those organizations. A data library may house local data and provide access in an open fashion or via password controls. Data libraries may also maintain subscriptions to licensed resources for use by researchers. Whether a data library is also "a data archive" will depend on the extent of unique holdings and whether long-term preservation services are offered. A good example of a data library is PANGAEA (Data Publisher for Earth & Environmental Science) which is both a library and publisher for earth system sciences data. Data can be georeferenced in time (date/time or geological age) and space (latitude, longitude, depth/height).
see the full entry
see the full entry
A data management plan is a formal document that outlines how you will manage your data before, during and after your research has been conducted. It describes the kind of data that will be gathered, any relevant standards that will be used to describe it (metadata), who will own it, who can access it and how long it will be preserved (or made accessible). A DMP will also outline the facilities and equipment needed to disseminate, share, and/or preserve data. Several funding agencies require or encourage the development of data management plans for research including the NIH and CIHR. (See Elements of a Data Management Plan
A data structure is a particular way of storing and organizing data on computers so that it can be stored and used efficiently.
see the full entry
Descriptive Metadata is understood as “information describing the intellectual content of the object, such as MARC cataloguing records, finding aids or similar schemes” (ODL). This information enables data to be accessed, arranged and evaluated (Reitz).
According to Wikipedia, "...digital forensics (or digital forensic science) is a branch of forensic science encompassing the recovery and investigation of material found in digital devices, often in relation to computer crime"
A digital object refers to item(s) stored in a digital library, and a digital object identifier, or DOI, is a permanent identifier associated with that object. Digital objects are content-independent data structures composed of digital data and metadata (such as policy expressions dictating use). DOIs facilitate searching, identifying, locating, accessing and dissemination of the information embedded within the object. Examples of digital objects are images, e-journals, e-books, digitized copies of print materials, and audio and video files: these items are technology-dependent and bound by intellectual property rights.
Dublin Core is the name given to the set of cataloguing elements used for web- based electronic resources. A conference held in Dublin, Ohio, developed the standard, which explains its origin. Fifteen elements (such as title, creator, subject, publisher, etc.) have been established as the “simple” set that is to be used for cataloguing these items. Additional elements are available. The elements are called “metadata” because they are used to describe data.
Dryad is an international repository of data underlying peer-reviewed articles in the basic and applied biosciences. Dryad enables scientists to validate published findings, explore new analysis methodologies, repurpose data for research questions unanticipated by the original authors, and perform synthetic studies. Dryad also aims to make data archiving as simple as possible via a suite of services not necessarily provided by publishers or institutional websites.
EAD is a project maintained by the Library of Congress, in partnership with the Society of American Archivists, and an industry-accepted XML standard. It is used by Librarians and Archivists for the consistent encoding of archival finding aids.
Fedora (Flexible Extensible Digital Object Repository Architecture) is an open source project whose focus is on durable, persistent access to digital data. It was developed by researchers at Cornell as an architecture to store, manage and access digital content as digital objects. Similar platforms: ContentDM, DSpace, Eprints, Fedora, Digital Commons, Hydra, Greenstone.
To filter is to select what to pass through from a stream. In a pipeline of software routines, each routine only includes in its output a suitable portion of its input. In a system for Selective Dissemination of Information, each client only receives personalized information.
A conceptual entity-relationship model developed by the International Federation of Library Associations and Institutions (IFLA) that relates the data recorded in library authority records to the needs of users of those records thereby facilitating the sharing of that data. http://www.ifla.org/node/947
A conceptual entity-relationship model developed by the International Federation of Library Associations and Institutions (IFLA) of three "many-to-many entities that serve as subjects of intellectual and artistic endeavor". It is meant to ease the global-sharing and use of subject authority data. http://www.ifla.org/node/947
A conceptual entity-relationship model developed by the International Federation of Library Associations and Institutions (IFLA) that relates user tasks of retrieval and access in online library catalogues and bibliographic databases. It represents a more holistic approach to retrieval and access as the relationships between entities provide links to navigate a hierarchy of relationships. The model is significant because it is separate from cataloguing standards such as AACR2, International Standard Bibliographic Description (ISBD) and RDA. http://www.ifla.org/about-the-frbr-review-group
Google Book Search is an online search tool and database of online books created by search giant Google. The digitized books in this collection can be viewed in part or whole as they are a) out of copyright, or because b) publishers and/or authors have granted permission for partial or complete access. Books in the public domain can be downloaded, saved and printed as PDFs.
Granularity refers to the composition of a piece of information and the extent to which it is broken down into smaller parts. The greater the number of parts it has the greater its granularity, enabling a progressive search through a hierarchy. The finer the grain, the greater the flexibility in searching a digital collection. In a collection of sports digital images, its granularity depends on the degree to which it is broken into its sub-categories: ie. European, field games, use of a ball, cricket, year, game, team, player, strike, etc. is an example of a finely grained collection.
GIF is an acronym for Graphics Interchange Format, a file format used to store image information such as corporate logos, simple animations and the like. GIF restricts images to 256 (8-bit palette) colours in the RGB (Red, Green, Blue) colour space which reduces the cost of image storage (memory) and transmit ion. GIF is not a suitable file format for continuous tone (or detailed) images such as photographs or for printer reproduction and uses a lossless compression (the decompressed image is the exact same as the compressed image).
Similar platforms: ContentDM, DSpace, Eprints, Fedora, Digital Commons, Hydra, Greenstone.
A systems ability to operate by working with another system, to create, or to work on a particular project. These systems rely each others information and abilities in order to turn out a valuable resolution.
Universally-compatible with browsers, viewers, and image editing software, JPEG is a method of compression, storing and transmitting of simple color-rich photographic images on the World Wide Web. As a lossy format, it helps to make files smaller and quicker to download. Files that have undergone jpeg compression usually have extensions such as jpeg, or jpg.
Metadata is a set of data (literally data about data) designed to facilitate the discovery and use of linked data. In library collections, three types of designated metadata are used to provide descriptive/content, structure/format or administrative/copyright information for each item in the collection.
METS is a digitization scheme focusing the descriptive, administrative and structural metadata using the XML language schema. METS tutorials are available in many languages and there is a METS listserv maintained by the Library of Congress in the Network Development and Marc Standards office.
MODS was developed by the Library of Congress’ National Development and Standards Office in June 2002. It is an XML, or Extensible Markup Language, which is used for carrying selected data from Marc 21 records or for creating original descriptive bibliographic records. It includes a subset of MARC fields and uses language-based tags other than numeric tags. This allows for the conversion of core fields while some specific data can be dropped, which allows for a simpler record to be created using more general tags than those available in the MARC record. It was designed as a compromise between the complexity of the MARC format used by libraries, and the simplicity of Dublin Core metadata.
Metadata schema us a standardized system of description used to identify the bibliographic details of a digital object (i.e. title, date of creation, author/contributor etc.) A well-designed schema outlines the formatting and required content standards for each of these individual details, which are known as metadata elements.
Values or observations that can be measured; these numbers can be placed in ascending or descending order. Scatter plots and line graphs are used to graph numerical data. Numerical data can also be seen to be consisting of digits (numbers) as opposed to letters, or characters. Numerical data is often analyzed using statistical methods with the results being displayed in terms of tables, charts and graphs.
The Open Archives Initiative (OAI) is a growing organization/effort that develops and promotes technological framework and low-barrier interoperability standards aiming to facilitate the efficient dissemination of digital content. It allows users to gather metadata from a number of repositories, through the use of standards such as Open Archives Protocol for Metadata Harvesting (OAI-PMH).
Object Reuse and Exchange, a specification fromOAI, defining standards for the exchange of aggregations of Web resources, i.e., compound digital objects, using URIs, resource maps, and proxies.
In information science, an ontology is a formal representation (model) of knowledge in a domain as a set of relevant concepts along with relationships, usually more comprehensive than a taxonomy, sometimes including definitions and properties associated with the concepts.
Open data is "...data that can be freely used, reused and redistributed by anyone subject to the requirement to attribute and sharealike of the Creative Commons movement".See Open Data Commons Attribution License
OCR (Optical Character Recognition) is software that reads handwritten, printed text, or typed text that has been scanned into a system, and translates it into digital (electronic) text. The benefits of this include making long piece of text available quickly and electronically, which makes it cost-and-space effective. Today’s OCR engines can poll results from all different indicators such as stroke edge and space between characters.
Distributed networks of computers that function as both client and server. The term peer-to-peer implies the lack of a centralized server and any related form of control. As a result, P2P networks are often used for file- and data-sharing among users.
A data initiative at Harvard University is called the REDCap project which is a set of free, web-based, and user-friendly electronic data capture (EDC) tools for research studies.
Resource Description and Access (RDA) emerged from the International Conference on the Principles & Future Development of AACR in 1997. Intended to replace AACR2, RDA will be released in late 2009. RDA departs from AACR in its foundation on the Functional Requirements for Bibliographic Records (FRBR). RDA is being developed as a new standard for resource description and access in the digital world.
A repository is a place (or container) where documents (and data) can be deposited for storage, safekeeping and access. In the context of digital libraries, a repository is a central place where data, in the form of digital information and collections of information, is stored and maintained in accessible formats. A digital repository is available to a community of users through an interface and can also be called a “digital library.”
The semantic web (SW) is a vision of the web that will extend its attributes for better searching and processing by computers and/or software agents (bots). While the current web uses HTML - marking up documents and creating links between them - its language is designed to be read by humans not computers. The SW uses frameworks such as Resource Description Framework (RDF), Web Ontology Language (OWL), and Extensible Markup Language (XML) - all designed to be read by machines. These languages enrich document description and provide more information about the properties of items on the web as well as the relationships between them. This data enables computers and software agents to understand the meaning of web pages, moving beyond iterative Google searching to carry out more sophisticated tasks for users.
A social network is a social structure that typically involves people and organizations that are inter-related, which can be described in a computer using a network or graph, where nodes represent entities and links designate relationships, and that can support key operations of systems like Facebook and LinkedIn. The interactions that these relationships and linkages represent create a lot of data which can be mined to determine ties between researchers (see Social network analysis).
SGML is a meta language of descriptive rules for other markup languages. It is an abstract syntax that specifies rules for tagging elements in a document. These tags can be interpreted by different hardware and software programs to implement format elements in different ways. HTML is an example of a specific markup language that interprets tags according to SGML rules. SGML was designed to be a framework that can remain readable for many decades.
Structural metadata refers to the internal organization of physical files, in logical order. Another way to think about structural metadata, is to think of it as the digital method to catalog materials, such as books, and photographs, in logical order.
A style sheet is a set of statements that inform the browser how to present information for various mediums. Using a style sheet, the author has precise control of spacing, text alignment, font characteristics, audio and speech output, position of elements etc. Style sheets can be reused or altered to change the look and feel of your documents. These sheets are separate from the document, thereby helping with device independence and access for those with disabilities.
The Text Encoding Initiative (TEI) develops guidelines for encoding methods for machine-readable texts. The guidelines are used by libraries, museums, publishers and individual scholars to present texts for online research, teaching, and preservation. TEI is a non profit member consortium primarily made up of academic institutions and scholars from around the world. Some examples of groups using TEI are the African American Women Writers of the 19th Century, the EpiDoc Collaborative, and the Versioning Machine.
A thesaurus (or controlled vocabulary) is a list of terms, and groups of terms, where terms point to related terms and the commonly-used form within a database or index. Common semantic relations incorporated include: synonym, antonym, broader term (hyperonym), narrower term (hyponym), part-whole (meronyms and holonyms), used for, and related term. Thesauri (plural form) may be for a natural language or for a specialized domain, like Art and Architecture. In information retrieval they may describe a controlled vocabulary and be utilized in search. A more general definition of a thesaurus is a list of synonyms or grouping of terms with similar meanings. In information retrieval, it is a list of subject headings in a catalogue or database (http://www.library.cornell.edu/olinuris/ref/research/vocab.html) A thesaurus can also be used as a tool for vocabulary control (http://publish.uwo.ca/~craven/677/thesaur/main01.htm), meaning that there is a pre-authorized list of approved terms that may be used as subject headings. The purpose of vocabulary control is to ensure efficient and consistent searching. Similar concepts are grouped under a single heading, such as business listings in a Yellow Pages phonebook (http://www.controlledvocabulary.com/).
TIFF is the acronym for Tagged Image File Format. The format is an extensible metadata standard for scanned images that has been around for over 20 years and is accepted by most applications. It is not a data format for images (unlike JPEG), and various image formats can be used to represent the digitized image. It can be used for vector-based images (ie. non-bitmapped), as well as high resolution images (16 bit or higher) or basic black and white. A single TIFF file can contain a number of images, up to a limit of 4Gb for the whole file. The usual file extension is .tif
The World Wide Web Consortium has become the primary organization for creating web specifications, and whose principal goal is interoperability of systems, and the sharing of information (and data).
Extensible Markup Language is a standard developed by W3C, using tagging for encoding documents so that important aspects of their structure are made explicit, and can be readily analyzed by software systems. As such, XML is a simple, flexible format derived from SGML and aids information systems in exchange of structured information by encoding documents and serializing data. XML is an open standard recommended by the World Wide Web Consortium (W3C). It it possible to define the content of a document separately from its formatting, making it easy to reuse the content in other applications or for other presentation environments.
XML DTD is an abbreviation of Extensible Markup Language Document Type Definition. It includes a set of mark-up tags and their interpretation. DTD tells editors what tags are allowed and how they can be applied and directs browsers about what kinds of page to display. In a DTD, MESSAGE tags contain TABLE tags; however, TABLE tags cannot contain MESSAGE tags. By setting rules, XML DTD ensures that all documents are formatted the same way, and will be displayed properly. Changing format of the document can be easily done by modifying the DTD.