Metadata underpins everything a digital collection can do — from search and discovery to long-term preservation and reuse across systems. Without agreed standards and formats, even the most carefully digitised materials can become difficult to manage, share, or sustain over time.

This guide provides a practical overview of the main types of metadata and the standards commonly used in digital collections, with a particular focus on digitised newspapers, books, and archival materials. It explains how different metadata standards work together, and reinforces why standards-based approaches are critical for long-term access and preservation.


 

What is Metadata?

Metadata is commonly described as “data about data,” but in digital collections it plays a much more active role. Metadata enables:

  • Search and discovery across large collections.
  • Navigation within complex digital resources, such as multi-page newspapers.
  • Sharing and reusing collections across organisations.
  • Long-term digital preservation and management.



Why Metadata Standards matter

Systems, vendors, and technologies naturally change over time, but digital collections are often expected to remain accessible and usable for decades. A platform that works well today may not be the one an institution relies on ten or twenty years from now.

Because of this, long-term preservation can’t depend solely on the platform used to provide access at any given moment. Instead, it relies on shared, industry-recognised metadata standards that are designed to last beyond individual systems.

In practice, not all digital collection platforms take the same approach. Some systems — including platforms such as CONTENTdm, Olive, Islandora, and PastPerfect — use platform-specific (often proprietary) internal data models to support storage, management, and presentation within their own environments.

Related reading: Can already digitized newspapers to METS/ALTO?


Veridian Software also uses an internal data model to support performance, indexing, and a good user experience. However, Veridian takes a standards-first approach by preserving digitised collections using formal metadata standards such as METS and ALTO as the long-term source of truth for each digital newspaper or book.

This matters because when collections are preserved in standards-based formats, the content remains portable and reusable, regardless of how access interfaces or software evolve. Migration then becomes a process of moving or reusing existing standards-based content, rather than trying to extract or reconstruct information from a platform-specific internal model.

Back to top^

Migration case studies:


XML: The building block for Archival Metadata

Many archival metadata standards are built using XML (Extensible Markup Language).
XML is not a metadata standard itself. Instead, it provides a structured, machine-readable format that allows metadata to be:

  1. Validated: Automatically checked to confirm that the metadata follows the required structure and rules of a given standard.
  2. Preserved long-term: Stored in an open, non-proprietary format designed to remain usable and readable as systems and technologies change.
  3. Interpreted consistently across systems: Understood in a predictable way by different software platforms based on shared, standardised definitions.
    XML’s long-term stability and clear structure are big reasons it’s still widely used in cultural heritage digitisation projects.




Types of Metadata in digital collections

Digital collections typically rely on multiple types of metadata, each serving a different purpose. The Library of Congress broadly categorises metadata into three types:

  1. Descriptive metadata supports identification and discovery. It holds information such as titles, creators, dates, subjects, and identifiers, helping users find and understand resources. Examples include Dublin Core and MODS. Skip to this section.
  2. Administrative metadata supports the management, use, and long-term stewardship of digital collections. It includes technical information about files, rights and access conditions, and metadata needed for system operation, control, and preservation. Examples include MIX, PREMIS and RightsMD. Skip to this section.
  3. Structural metadata describes how digital files relate to one another, such as the order of pages in a book or newspaper, or the relationship between files and descriptive metadata. Examples include METS and ALTO. J to this section.

Back to top^



Descriptive Metadata

Dublin Core

Dublin Core is a widely used, straightforward descriptive metadata standard that is commonly implemented using XML. It defines a small set of just 15 core elements — such as title, creator, date, subject, and publisher — that can be used to describe many different types of digital resources.

These elements are intentionally simple and consistent, which makes Dublin Core easy to reuse across different repositories, discovery tools, and aggregation services.

It takes its name from Dublin, Ohio, where librarians and information specialists met at the 1995 OCLC/NCSA Metadata Workshop. Those discussions led to the idea of a shared “core” set of metadata elements that could be used across different institutions and collection types. Dublin Core has since been maintained and further developed by the Dublin Core Metadata Initiative (DCMI).

Dublin Core is often a good fit when:

  • Simplicity and ease of use are priorities, and full library cataloguing alignment isn’t required.
  • Metadata needs to be shared or reused across systems.
  • A consistent, high-level description is needed across different types of material. For example, when a collection includes photographs, letters and diaries, and each item needs the same basic descriptive information (such as title, creator and date) to support discovery.

While Dublin Core is flexible and easy to implement, it lacks the depth required for some use cases.

Dublin Core is less suited to:

  • Highly structured newspapers.
  • Manuscripts with complex structures.
  • Digitisation projects designed for long-term preservation, not just short-term access.

MODS (Metadata Object Description Schema)

MODS is an XML-based descriptive metadata standard that provides more detailed description than Dublin Core. It supports approximately 20 top-level elements, each with extensive sub-elements and attributes to capture complex descriptive information.

The standard was developed in 2002 by the Library of Congress to help translate traditional library catalogue information — particularly records based on MARC— into a format that works well in modern digital systems. Many MODS elements are derived from, or closely aligned with, MARC fields, which makes MODS familiar to cataloguers while being better suited to XML-based digital collections.

MODS is often chosen when:

  • Materials have complex details, such as multiple titles, creators, or roles.
  • Metadata needs to align with existing library catalogue records
  • A collection needs to integrate with standards like METS and PREMIS

Because of its depth and flexibility, MODS is commonly used as the authoritative descriptive metadata in digitisation projects, particularly for books, newspapers, and archival materials.

While MODS provides far greater descriptive detail than Dublin Core, it is not intended to describe structural relationships or page-level content. For this reason, MODS is frequently used alongside METS, where MODS supplies the descriptive metadata and METS provides the structural framework.

Checking the quality of Descriptive Metadata

Quality issues are often less about whether the metadata is technically valid and more about whether it is consistent and complete. Common challenges we’ve seen include:

  • Titles, dates, or names being recorded in different ways.
  • Mixing free-text entries with controlled or standardised values.
  • Missing or unclear identifiers
  • Similar resources being described differently across the same collection.

These kinds of issues often only become apparent when metadata is aggregated, shared, or reused across systems — which is why ongoing review and normalisation are an important part of metadata management.

Back to top^



Administrative Metadata

MIX (Metadata for Images in XML)

MIX is an XML-based metadata standard used to capture the technical characteristics of image files, particularly those created through digitisation. It records details about how an image file was created and what it contains at a technical level, rather than describing the visual content of the image itself.

Developed in the early 2000s and maintained by the Library of Congress, MIX is widely used in digitisation projects where consistent documentation of image files supports quality assurance, preservation planning, and future reuse.

MIX is often chosen when:

  • Technical characteristics of image files need to be recorded.
  • Scanning settings and equipment form part of the digitisation record.
  • Technical metadata supports quality checks or long-term preservation decisions.

A MIX record can include information such as image dimensions, resolution, colour space, bit depth, file format, and compression. In practice, MIX metadata is usually stored or referenced within a METS document, where it contributes to the administrative metadata that supports the ongoing management and preservation of digitised images.

Checking the quality of MIX Metadata

For MIX metadata, quality issues are usually less visible to end users, but still important for digitisation accuracy and long-term management.  Common issues can include:

  • Image dimensions, resolution, or colour space that don’t quite match the source files.
  • Missing or inconsistent recording of scanning equipment or settings.
  • Technical values that look correct, but are inaccurate due to automated extraction errors.
  • Small differences in how similar images are documented across a project.

In practice, quality assurance usually involves spot-checking against the original image files, reviewing consistency across batches, and taking a close look at digitisation workflows to make sure technical metadata is being captured reliably.

PREMIS (Preservation Metadata: Implementation Strategies)

PREMIS is a widely adopted administrative metadata standard designed to support the long-term preservation and management of digital resources by recording what has happened to them over time — including actions taken, technical changes, and who or what was responsible.

This standard is often chosen when:

  • Keeping digital files usable over the long term is a priority.
  • There’s a need to record what actions have been taken on files, and why.
  • Being able to review, verify, or explain how resources have been managed matters.
  • Collections need to remain reliable and usable over time, even as systems change.

For example, in a digitised newspaper collection, PREMIS metadata might record when OCR files were created, when fixity checks were carried out, whether files were migrated to new formats, and which systems or staff were responsible for those actions.

The PREMIS standard was developed by the PREMIS Working Group, originally brought together by OCLC and the Research Libraries Group, and is now maintained by the Library of Congress, which oversees its ongoing development.

PREMIS isn’t tied to a single file format, but it’s most often used with XML. In practice, PREMIS information is usually included alongside other metadata in a METS file, so preservation details stay connected to the rest of the collection over time.

Checking the quality of PREMIS Metadata

While PREMIS metadata can be checked to make sure it follows the right structure, its real value comes from how complete and consistent the preservation history is. Problems don’t always show up straight away, and often only become obvious later — for example during an audit, when planning preservation work, or when moving to a new system.

Common quality issues include:

  • Preservation actions that were never recorded.
  • Missing details about things like fixity checks or format migrations.
  • Dates, results, or responsibilities being recorded in different ways.

RightsMD

RightsMD is used to record information about rights, licences, and access conditions for digital resources, helping clarify how a resource can be used, shared, or restricted over time. This type of administrative metadata is usually recorded inside a METS file, alongside the rest of a collection’s metadata.

In practice, RightsMD often points to external rights statements or licences—such as those from RightsStatements.org— or includes short notes explaining how a resource can be accessed or reused. Keeping this information with the rest of the metadata makes it easier to understand, and ensures it can move with the collection as systems change.

Back to top^



Structural Metadata

METS (Metadata Encoding and Transmission Standard)

METS is a popular XML-based metadata standard that helps describe how complex digital resources are put together. A helpful way to think about METS is as a wrapper or container that brings different types of metadata into one organised, consistent structure.

It typically brings together:

  • Descriptive metadata, such as MODS or Dublin Core.
  • Administrative metadata, including MIX and PREMIS.
  • File-level information, like page images and OCR (optical character recognition) text.

Related reading: What does a METS XML file contain?


METS is especially useful f
or digital resources made up of many files, such as digitised books, newspapers, manuscripts, and archival collections. It lets institutions clearly show how files relate to one another, what order they belong in, and which metadata applies to which part of the resource.

The METS standard was developed in the late 1990s as part of a digital library initiative led by the Library of Congress, and it continues to be maintained there today as a community-supported standard for organising complex digital resources.

METS is often a good choice when:

A digital resource is made up of many files or has a clear hierarchy (for example, issues, sections, and pages).

  • The order of pages — and how files relate to one another — needs to be preserved.
  • A collection needs to be preserved long term and remain portable as systems change.
  • Different types of metadata need to be managed together in a single, structured package.

Checking the quality of METS Metadata

With METS, most problems aren’t technical errors but structural ones. Common issues include:

  • Structural maps that are incomplete or don’t reflect the intended page order.
  • Broken or inconsistent links between files and metadata sections.
  • Relationships between descriptive, administrative, and file-level metadata that don’t quite line up.

These kinds of problems aren’t always visible, especially when content appears to “work” on screen. They tend to surface later, during migration, transformation, or reuse, when the structural integrity of the METS document becomes critical.


 

ALTO (Analyzed Layout and Text Object)

ALTO is a widely used, XML-based metadata standard that describes both the text in a digitised document and how that text appears on the page. It’s most commonly used with OCR output, where understanding where words sit on a page is just as important as the words themselves.

This makes ALTO especially important for page-based materials like newspapers, books, journals, and manuscripts. In these formats, meaning often comes from layout — columns, headings, articles, and line breaks all help readers make sense of the content. ALTO captures this layout information alongside the recognised text, helping digital systems keep scanned page images closely aligned with their text over time.

An ALTO file usually describes a single page and may include:

  • Page dimensions and orientation.
  • Layout elements such as text blocks, lines, and words.
  • Coordinates that map text to the underlying image.
  • Reading order and basic structural relationships.

Related reading: What does a ALTO XML file contain?


ALTO was developed by the METAe project in Europe. It was then taken over by the Library of Congress, which continues to maintain and evolve the standard.

ALTO is often included where:

  • Page layout and reading order must be preserved.
  • Text must remain accurately aligned with page images.
  • Word- and line-level structure is important.
  • Collections need to support accessibility and detailed page-level interaction.

Used together, METS and ALTO provide a complete picture of a digitised resource. METS defines how files and pages fit together, while ALTO supplies the detailed page-level text and layout needed to work with those pages over time.

Related reading: Enhancing newspaper digitization with METS/ALTO


Checking the Quality of ALTO Metadata

ALTO operates at a detailed, page-level structure, which means some data issues may remain hidden unless the metadata is closely examined or validated through quality control checks. Learn more about uncovering and fixing hidden ALTO data errors.

Back to top^



Bringing Metadata Standards together

As mentioned throughout this guide, digital collections rarely rely on a single metadata standard. Instead, several standards work together, with each one handling a different part of how a digital resource is created, described, managed, and preserved over time.

A typical digitised newspaper or book workflow helps show how this fits together in practice:

Scanning

Scanning is the process that converts original collection materials into high-resolution digital master images, typically uncompressed TIFFs.

 

OCR processing

This extracts extracts text and structure, making collections searchable and accessible online.

Descriptive Metadata

MODS (or Dublin Core) provides descriptive information about the historic material, such as title, date, publication details, and subjects.

Administrative Metadata

For example PREMIS, which records preservation actions and events, documenting how files have been checked, managed, or changed over time.

Structual Metadata

METS then brings all of these pieces together, defining how page images, ALTO files, descriptive metadata, and preservation information relate to one another within a single issue, volume, or object.


A practical example

For example, in a digitised newspaper collection, METS may describe the overall structure of an issue, link each page image to its corresponding ALTO file, reference MODS metadata describing the title or issue, and include PREMIS metadata documenting actions such as OCR generation, fixity checks, or format migrations.

.

Together, these standards form a complete, standards-based representation of a digital resource. Descriptive metadata supports discovery, structural metadata preserves relationships and navigation, and administrative metadata supports long-term management and preservation — helping reinforce why no single standard is sufficient on its own.

Back to top^


 

How to choose the right Metadata Standards

The best choice depends on what you’re digitising, how the collection will be used, and how long it needs to remain accessible. In practice, most digitisation projects use several standards together rather than relying on just one.

The questions below outline some high-level considerations. For project-specific advice, it’s often helpful to talk with an experienced digitisation and metadata team.

What kind of material are you working with?

  • For simpler or mixed collections—such as photographs, letters, or reports—Dublin Core may be enough to support discovery and basic description.
  • For books, newspapers, or other structured text, richer standards like MODS, combined with METS and ALTO, are often needed to capture layout, structure, and relationships between pages and files.

How important is long-term preservation?

  • If a collection is mainly intended for short-term access, simpler metadata may be sufficient.
  • If the collection is expected to remain usable for decades, preservation-focused standards such as METS, ALTO, and PREMIS play a much bigger role.

Will the collection need to move or be reused?

  • When metadata needs to be shared, aggregated, or migrated between systems over time, standards-based formats make that process far easier and less risky.
  • Platform-specific metadata can work well in the short term, but often creates challenges when systems change.

How complex is the source historic material?

  • Single-file items require less structural metadata.
  • Multi-page or hierarchical resources—such as newspaper issues or books—benefit from METS to describe structure and ALTO to capture detailed, page-level content.



Final thoughts

Digital collections are rarely static. Over time, content grows, systems change, and expectations around access and preservation evolve. Metadata standards provide the foundation that allows collections to adapt to that change without losing meaning, structure, or trust.

No single standard does everything on its own. Instead, long-term, sustainable collections rely on a combination of descriptive, structural, and administrative metadata working together — each doing a specific job, and each designed to last beyond any single platform.