Digital Libraries Glossary
To browse other articles on a range of HSL topics, see the A-Z index.
The Digital Libraries Glossary is a work in progess. Library-related definitions have been revised at least twice; once by students taking a course entitled LIBR 1395: Special Topics - Creating and Managing Digital Collections and twice by me as their instructor. In any case, here they are for general consumption and access. Feel free to correct any problems; sign up for editing privileges at the top right of page.
Administrative metadata refers to information used to manage or control access to an object or collection. This metadata may include information on how the object or collection is stored, how it was scanned, copyrights and licenses, and what is required for long-term preservation.
Analog (also analogue) formats are those considered to be of a print or paper era e.g., print books, audio, video. Analog may also be defined as "...of or relating to a device or process in which data is represented by physical quantities that change continuously..." Analog is a term used to describe a device or system that represents changing values as continuously variable "physical quantities". A typical analog device is a clock where hands move continuously around a face physically. This clock is capable of indicating every possible time of day. By contrast, digital clocks are capable of representing a finite number of times (every tenth of a second, for example). In general, humans experience the world analogically. Vision, for example, is an analog experience because we perceive infinitely smooth gradations of shapes and colors.
Born digital is digital content that originated as a digital file and is distinct from other digital media created through digitization. To be "born digital" means to come into being as an eletronic or digital file; content examples includes original images made with digital cameras, spreadsheets, and word processing documents, among others.
Browsing is a type of exploring for information, typically using some logical or visual organization as an aid, as in the Internet or in a digital file or other media and collection of documents.
Vannevar Bush (1890 – 1974) was born and raised in Everett Massachusetts, and was one of the most visible American scientists of the first half of the 20th century.
A pervasive quality dimension that is associated with many digital library concepts, reflecting if everything appropriate is included. One approach to assess completeness quantitatively is to divide the number of units of some type that are associated with a concept, by the ideal number of units for that concept.
Also a bot; a computer program which crawls an archive or the Internet or hyperbase in a methodical, automated manner.
Querying, searching, and accessing information written in a natural language different from the language of a user’s query.
Crosswalks are sets of rules that show how the fields in different metadata systems are related. These rules enable networked systems to interpret and translate data to different metadata standards for sharing or conversion. A crosswalk can also be a table like the one below showing equivalent elements (or "fields") in more than one database; it maps elements in one metadata scheme to equivalent elements in another. The crosswalk below shows MARC standards to Dublin Core:
Curation refers to the organization and and maintenance of (digital) data.
see the full entry
Descriptive Metadata is understood as “information describing the intellectual content of the object, such as MARC cataloguing records, finding aids or similar schemes” (ODL). This information enables data to be accessed, arranged and evaluated (Reitz).
The digital classroom (see related flipped classroom, blended learning models and smart classrooms) is the technology-enabled classroom where learning is fully supported online through the strategic use of information and communication technologies (ICTs) for all learners in the 24/7 classroom
According to Wikipedia, "...digital forensics (or digital forensic science) is a branch of forensic science encompassing the recovery and investigation of material found in digital devices, often in relation to computer crime"
The Digital Library Federation <http://www.diglib.org/about.htm> is an association of libraries and allied institutions taht work together to establish an international network of digital libraries. Its aim is to promote strategies for collection development, identify best-practices and standards for the production of electronic-information technologies as well as providing practical initiatives for the preservation of digital collections.
Digital Object is a term used to describe item(s) stored in a digital library. A digital object is a content-independent data structure principally composed of digital material or data as well as metadata (such as policy expressions dictating use). A digital object will include an identifier to facilitate searching, identifying, locating, accessing, and dissemination. Examples of digital objects are images, e-journals, e-books, digitized copies of print materials, and audio and video files: these items are technology-dependent and bound by intellectual property rights.
DRM refers to the protection of content from the different logical security attacks and abuses relating to disrespecting intellectual property rights.
Digital watermarks (also called data embedding), are invisible or visible marks formed by a pattern of “bits” inserted into an electronic image, music, video, or any material to protect the copyright of a material's owner. Digital watermarks are like those used in currencies for counterfeit security. "Stenography" can also be applied in digital watermarks to make the watermark less detectable. The digital watermark must be robust enough to withstand normal changes to the file, such as reductions from “lossy compression”.
Digitization is the process of taking traditional library materials that are in form of books and papers and converting them to the electronic form where they can be stored and manipulated by a computer (Witten & Bainbridge, 2003 cited in Kamusiime & Mukasa, 2012). The United Nations (2011) describes digitization as the process of converting analog items, such as a paper records, photographs or graphic items, into an electronic representation or image that can be accessed and stored electronically. It involves translation of data into digital form (binary coded files for use in computers). Scanning images, sampling sound, converting text on paper into text in computer files, all are examples of digitization (Lopatin, 2006.)
Dublin Core is the name given to the set of cataloguing elements used for web- based electronic resources. A conference held in Dublin, Ohio, developed the standard, which explains its origin. Fifteen elements (such as title, creator, subject, publisher, etc.) have been established as the “simple” set that is to be used for cataloguing these items. Additional elements are available. The elements are called “metadata” because they are used to describe data.
The EAD Document Type Definition (DTD) is a nonproprietary standard for encoding in Standard Generalized Markup Language (SGML) or Extensible Markup Language (XML) the finding aids (registers, inventories, indexes, etc.) used in archives, libraries, museums, and other repositories of manuscripts and primary sources to facilitate use of their materials. EAD was developed in 1993 on the initiative of the UC Berkeley Library and is maintained by the Library of Congress, in partnership with the Society of American Archivists.
To filter is to select what to pass through from a stream. In a pipeline of software routines, each routine only includes in its output a suitable portion of its input. In a system for Selective Dissemination of Information, each client only receives personalized information.
A conceptual entity-relationship model developed by the International Federation of Library Associations and Institutions (IFLA) that relates the data recorded in library authority records to the needs of users of those records thereby facilitating the sharing of that data. http://www.ifla.org/node/947
A conceptual entity-relationship model developed by the International Federation of Library Associations and Institutions (IFLA) of three "many-to-many entities that serve as subjects of intellectual and artistic endeavor". It is meant to ease the global-sharing and use of subject authority data. http://www.ifla.org/node/947
A conceptual entity-relationship model developed by the International Federation of Library Associations and Institutions (IFLA) that relates user tasks of retrieval and access in online library catalogues and bibliographic databases. It represents a more holistic approach to retrieval and access as the relationships between entities provide links to navigate a hierarchy of relationships. The model is significant because it is separate from cataloguing standards such as AACR2, International Standard Bibliographic Description (ISBD) and RDA. http://www.ifla.org/about-the-frbr-review-group
Google Book Search is an online search tool and database of online books created by search giant Google. The digitized books in this collection can be viewed in part or whole as they are a) out of copyright, or because b) publishers and/or authors have granted permission for partial or complete access. Books in the public domain can be downloaded, saved and printed as PDFs.
Granularity refers to the extent to which a system contains separate components (like granules). The more components in a system or the greater the granularity the more flexible it is. The composition of a piece of information and extent to which it can be broken into smaller parts is a type of granularity. The greater number of parts it has the greater its granularity, enabling a gradual, progressive search through a hierarchy. The finer the grains, the greater the flexibility in searching a digital collection. In a collection of digital images of sporting events, its granularity depends on the degree to which it is broken into its sub-categories: ie. European, field games, use of a ball, cricket, year, game, team, player, strike, etc. is an example of a finely grained collection.
GIF is an acronym for Graphics Interchange Format, a file format used to store image information such as corporate logos, simple animations and the like. GIF restricts images to 256 (8-bit palette) colours in the RGB (Red, Green, Blue) colour space which reduces the cost of image storage (memory) and transmit ion. GIF is not a suitable file format for continuous tone (or detailed) images such as photographs or for printer reproduction and uses a lossless compression (the decompressed image is the exact same as the compressed image).
A systems ability to operate by working with another system, to create, or to work on a particular project. These systems rely each others information and abilities in order to turn out a valuable resolution.
Universally-compatible with browsers, viewers, and image editing software, JPEG is a method of compression, storing and transmitting of simple color-rich photographic images on the World Wide Web. As a lossy format, it helps to make files smaller and quicker to download. Files that have undergone jpeg compression usually have extensions such as jpeg, or jpg.
Metadata is a set of data (literally data about data) designed to facilitate the discovery and use of linked data. In library collections, three types of designated metadata are used to provide descriptive/content, structure/format or administrative/copyright information for each item in the collection.
METS is a digitization scheme focusing the descriptive, administrative and structural metadata using the XML language schema. METS tutorials are available in many languages and there is a METS listserv maintained by the Library of Congress in the Network Development and Marc Standards office.
MODS was developed by the Library of Congress’ National Development and Standards Office in June 2002. It is an XML, or Extensible Markup Language, which is used for carrying selected data from Marc 21 records or for creating original descriptive bibliographic records. It includes a subset of MARC fields and uses language-based tags other than numeric tags. This allows for the conversion of core fields while some specific data can be dropped, which allows for a simpler record to be created using more general tags than those available in the MARC record. It was designed as a compromise between the complexity of the MARC format used by libraries, and the simplicity of Dublin Core metadata.
Metadata schema us a standardized system of description used to identify the bibliographic details of a digital object (i.e. title, date of creation, author/contributor etc.) A well-designed schema outlines the formatting and required content standards for each of these individual details, which are known as metadata elements.
The Open Archives Initiative (OAI) is a growing organization/effort that develops and promotes technological framework and low-barrier interoperability standards aiming to facilitate the efficient dissemination of digital content. It allows users to gather metadata from a number of repositories, through the use of standards such as Open Archives Protocol for Metadata Harvesting (OAI-PMH).
Object Reuse and Exchange, a specification fromOAI, defining standards for the exchange of aggregations of Web resources, i.e., compound digital objects, using URIs, resource maps, and proxies.
In information science, an ontology is a formal representation (model) of knowledge in a domain as a set of relevant concepts along with relationships, usually more comprehensive than a taxonomy, sometimes including definitions and properties associated with the concepts.
OCR (Optical Character Recognition) is software that reads handwritten, printed text, or typed text that has been scanned into a system, and translates it into digital (electronic) text. The benefits of this include making long piece of text available quickly and electronically, which makes it cost-and-space effective. Today’s OCR engines can poll results from all different indicators such as stroke edge and space between characters.
Distributed networks of computers that function as both client and server. The term peer-to-peer implies the lack of a centralized server and any related form of control. As a result, P2P networks are often used for file sharing between users.
Project Gutenberg <http://www.gutenberg.org/wiki/Main_Page> is the largest and oldest creator/distributor of free eBooks. (The eBook was created by Michael Hart in 1971.) Through PG, Hart has endeavoured to provide free books to as many people as possible in the most accessible format. Two major ideas behind PG are that books should cost so little that no one cares about cost and that PG is a grassroots volunteer organization. The second philosophy is about ease of use. Almost all ebooks are available in ASCII or “Plain Vanilla ASCII” so that nearly all machines and software can display them. There are sound and image files available also; PG is the epitome of a digital library. All files available for download are in the public domain and as of April 2008 there were 25000 eBooks. The goal is to have one million eBooks by 2015 and to begin translating works into as many languages as possible.
Resource Description and Access (RDA) emerged from the International Conference on the Principles & Future Development of AACR in 1997. Intended to replace AACR2, RDA will be released in late 2009. RDA departs from AACR in its foundation on the Functional Requirements for Bibliographic Records (FRBR). RDA is being developed as a new standard for resource description and access in the digital world.
A repository is a place (or container) where documents are deposited for storage, safekeeping and access. In the context of digital libraries, a repository is a central place where data, in the form of digital information and collections of information, is stored and maintained in accessible formats. A digital repository is available to a community of users through an interface and can also be called a “digital library.”
Resolution, or image resolution, refers to the number of dots, or pixels, per square inch used to display an image. The term refers to digital, film and other types of images. The quality of a digital image is based on number of pixels in the image, and resolution is measured in many ways. Resolution not only determines how close lines can be to each other and still be resolved, but resolution units are tied to physical sizes (e.g. lines per mm, lines per inch) or to the overall size of a picture (lines per picture height, or lines). Also, line pairs are often used instead of lines. A line pair is a pair of adjacent dark and light lines, while lines count both dark lines and light lines. The higher the resolution of a digital image the more pixels are used to create it which will have sharper contrast, and a crisper, cleaner look. A low-resolution digital image may suffice for websites, but may appear blurry and distorted when enlarged. The display of pixels in a digital image is indicated by a number combination, such as 800 x 600. This means there are 800 dots horizontally across the monitor, by 600 lines of dots vertically, or 480,000 dots that make up the image on the screen.
The semantic web (SW) is a vision of the web that will extend its attributes for better searching and processing by computers and/or software agents (bots). While the current web uses HTML - marking up documents and creating links between them - its language is designed to be read by humans not computers. The SW uses frameworks such as Resource Description Framework (RDF), Web Ontology Language (OWL), and Extensible Markup Language (XML) - all designed to be read by machines. These languages enrich document description and provide more information about the properties of items on the web as well as the relationships between them. This data enables computers and software agents to understand the meaning of web pages, moving beyond iterative Google searching to carry out more sophisticated tasks for users.
A social network is a social structure that typically involves people and organizations that are inter-related, which can be described in a computer using a network or graph, where nodes represent entities and links designate relationships, and that can support key operations of systems like Facebook and LinkedIn.
SGML is a meta language of descriptive rules for other markup languages. It is an abstract syntax that specifies rules for tagging elements in a document. These tags can be interpreted by different hardware and software programs to implement format elements in different ways. HTML is an example of a specific markup language that interprets tags according to SGML rules. SGML was designed to be a framework that can remain readable for many decades.
Structural metadata refers to the internal organization of physical files, in logical order. Another way to think about structural metadata, is to think of it as the digital method to catalog materials, such as books, and photographs, in logical order.
A style sheet is a set of statements that inform the browser how to present information for various mediums. Using a style sheet, the author has precise control of spacing, text alignment, font characteristics, audio and speech output, position of elements etc. Style sheets can be reused or altered to change the look and feel of your documents. These sheets are separate from the document, thereby helping with device independence and access for those with disabilities.
The Text Encoding Initiative (TEI) develops guidelines for encoding methods for machine-readable texts. The guidelines are used by libraries, museums, publishers and individual scholars to present texts for online research, teaching, and preservation. TEI is a non profit member consortium primarily made up of academic institutions and scholars from around the world. Some examples of groups using TEI are the African American Women Writers of the 19th Century, the EpiDoc Collaborative, and the Versioning Machine.
A thesaurus is a list of groups of terms, wherein each group includes semantically related entries and usually relates to a concept. Common semantic relations incorporated include: synonym, antonym, broader term (hyperonym), narrower term (hyponym), part-whole (meronyms and holonyms), used for, and related term. Thesauri (plural form) may be for a natural language or for a specialized domain, like Art and Architecture. In information retrieval they may describe a controlled vocabulary and be utilized in search. A more general definition of a thesaurus is a list of synonyms or grouping of terms with similar meanings. In information retrieval, it is a list of subject headings in a catalogue or database (http://www.library.cornell.edu/olinuris/ref/research/vocab.html) A thesaurus can also be used as a tool for vocabulary control (http://publish.uwo.ca/~craven/677/thesaur/main01.htm), meaning that there is a pre-authorized list of approved terms that may be used as subject headings. The purpose of vocabulary control is to ensure efficient and consistent searching. Similar concepts are grouped under a single heading, such as business listings in a Yellow Pages phonebook (http://www.controlledvocabulary.com/).
TIFF is the acronym for Tagged Image File Format. The format is an extensible metadata standard for scanned images that has been around for over 20 years and is accepted by most applications. It is not a data format for images (unlike JPEG), and various image formats can be used to represent the digitized image. It can be used for vector-based images (ie. non-bitmapped), as well as high resolution images (16 bit or higher) or basic black and white. A single TIFF file can contain a number of images, up to a limit of 4Gb for the whole file. The usual file extension is .tif
Extensible Markup Language is a standard developed by W3C, using tagging for encoding documents so that important aspects of their structure are made explicit, and can be readily analyzed by software systems. As such, XML is a simple, flexible format derived from SGML and aids information systems in exchange of structured information by encoding documents and serializing data. XML is an open standard recommended by the World Wide Web Consortium (W3C). It it possible to define the content of a document separately from its formatting, making it easy to reuse the content in other applications or for other presentation environments.
XML DTD is an abbreviation of Extensible Markup Language Document Type Definition. It includes a set of mark-up tags and their interpretation. DTD tells editors what tags are allowed and how they can be applied and directs browsers about what kinds of page to display. In a DTD, MESSAGE tags contain TABLE tags; however, TABLE tags cannot contain MESSAGE tags. By setting rules, XML DTD ensures that all documents are formatted the same way, and will be displayed properly. Changing format of the document can be easily done by modifying the DTD.
A weighting scheme is a way to determine the weight or importance of different components in a search. In information retrieval, it is used to assign values in a vector space, to give priority to fields of documents and specify the relative values of different signals considered when assigning similarity scores.