What is BagIt and why is it used?
BagIt is a hierarchical file packaging format that has been designed to maintain data integrity and completeness when digital content is stored on disks and transferred over networks. For example, if a batch of METS/ALTO data is added to a BagIt bag, BagIt can help preserve and provide assurance of data accuracy, completeness and consistency for this batch of data over its life-cycle.
A BagIt 'bag' is a digital container for storing or sending digital content, in much the same way a cardboard box is a physical container that can be used for storing or sending physical items.
BagIt is used as a standard format for transferring digital content between different organisations, such as archives, libraries and museums. It provides a convenient and reliable way to package and transfer digital content, while ensuring that the content and its associated metadata remain intact and complete during the transfer process.
Additionally, the hierarchical file structure of a BagIt package makes it easy for organisations to manage and organise their digital content, and for recipient institutions to understand the structure and contents of the package.
The BagIt specification is described in RFC 8493 (https://www.rfc-editor.org/rfc/rfc8493).
Why do we use BagIt at Veridian?
We use BagIt as part of our processes to maintain data integrity and completeness, and it’s an integral part of our backup systems. We also use BagIt for the same purposes when transferring data to and from organisations such as local, state or national libraries, when they also use BagIt.
What are some examples of usage?
Here at Veridian we use the Java BagIt Library (BIL) developed by the Library of Congress available here: https://sourceforge.net/projects/loc-xferutils/files/loc-bil-java-library/4.4/
BAGIT LIBRARY (BIL)
Version 4.4
BagIt-Version: 0.97
As this particular BagIt library is a Java library, it should be able to be used on the command line on most Windows, macOS or Linux based operating systems that have a Java Virtual Machine (JVM) installed.
Here are some common commands we use on the command line when dealing with 'bags' (or batches), on an example batch. BATCH1 in this example is a directory structure containing a batch of METS/ALTO data for a newspaper publication.
bag baginplace BATCH1
Creates a bag in place. All of the batch directory structure and files that exist directly in the BATCH1/* directory are automatically moved into the BATCH1/data/* directory. Often used to create a bag.
bag update BATCH1
Updates the manifests and (if it exists) the bag-info.txt for a bag. Used when the data/* directory is changed in a bag, to update the bag.
bag verifycomplete BATCH1
Verifies the completeness of a bag. This command checks that all of the files in the bag (data/* directory) are listed in the manifest, and also that all the files listed in the manifest actually exist in the bag (data/* directory). However this command does not verify the checksums for all files. This is not a complete check of a bag (like verifyvalid) but it is very fast and much less processor intensive as a quick check that a bag currently has all the files we might expect. To properly check a bag's data is intact, use verifyvalid.
bag verifyvalid BATCH1
Verifies the validity of a bag. This command does the same cross checking of the manifest against files in the bag (data/* directory) as the verifycomplete command, but in this case also performs a verification of the checksums for all files.
How is a BagIt bag structured?
Here we describe first a minimal example, then a more complete example of the file and directory structure of a BagIt bag. Note that the BagIt bags we manage typically also include the optional tagmanifest-md5.txt and bag-info.txt files.
Minimal example:
BATCH1/ (BagIt bag)
|-- data (required, contents of bag as a directory structure of files)
| \-- BATCH1
| \-- NYT
| \-- 1940
| \-- 07
| \-- 01
| \-- mets.xml
| \-- 0001.jp2
| \-- 0001.xml
| \-- ... etc
|
|-- manifest-md5.txt (required)
| sample contents:
| dc6f05fdaf9893fbcaf2602da0621c3d data/BATCH1/NYT/1940/07/01/mets.xml
| fb38e018cbffe3f2a820576182d9fe7c data/BATCH1/NYT/1940/07/01/0001.jp2
| 89ce616b59ab5d4e7131c18d0f8c8cd7 data/BATCH1/NYT/1940/07/01/0001.xml
| a5d1407b13d303a3c4742ce91a8e2f5c data/BATCH1/NYT/1940/07/01/... etc.
| ...
| (A manifest file is needed that lists the filenames contained in the
| 'data' directory along with their corresponding checksums.)
|
\-- bagit.txt (required)
sample contents:
BagIt-Version: 0.97
Tag-File-Character-Encoding: UTF-8
(Fulfils requirements that:
- existence of this file identifies the directory as a bag
- BagIt specification version is listed
- Character encoding utilized for tag files is specified )
More complete example:
BATCH1/ (BagIt bag)
|-- data (required, contents of bag as a directory structure of files)
| \-- BATCH1
| \-- NYT
| \-- 1940
| \-- 07
| \-- 01
| \-- mets.xml
| \-- 0001.jp2
| \-- 0001.xml
| \-- ... etc
|
|-- manifest-md5.txt (required)
| sample contents:
| dc6f05fdaf9893fbcaf2602da0621c3d data/BATCH1/NYT/1940/07/01/mets.xml
| fb38e018cbffe3f2a820576182d9fe7c data/BATCH1/NYT/1940/07/01/0001.jp2
| 89ce616b59ab5d4e7131c18d0f8c8cd7 data/BATCH1/NYT/1940/07/01/0001.xml
| a5d1407b13d303a3c4742ce91a8e2f5c data/BATCH1/NYT/1940/07/01/... etc.
| ...
| (A manifest file is needed that lists the filenames contained in the
| 'data' directory along with their corresponding checksums.)
|
|-- bagit.txt (required)
| sample contents:
| BagIt-Version: 0.97
| Tag-File-Character-Encoding: UTF-8
| (Fulfils requirements that:
| - existence of this file identifies the directory as a bag
| - BagIt specification version is listed
| - Character encoding utilized for tag files is specified )
|
|-- tagmanifest-md5.txt (optional)
| sample contents:
| de37d632e49392ee8afde796291e709e bag-info.txt
| 9e5ad981e0d29adc278f6a294b8c2aca bagit.txt
| b5154cd3cf0434f7c40349a4dd8a2638 manifest-md5.txt
| (tag manifest file which lists tag files and their associated checksums)
|
|-- bag-info.txt (optional)
| sample contents:
| Payload-Oxum: 70106496963.23926
| Bagging-Date: 2019-06-14
| Bag-Size: 65.3 GB
| (contains metadata for the bag listed as colon-separated key/value pairs)
|
\-- fetch.txt (optional)
sample contents:
https://cdn.loc.gov/master/pnp/highsm/23300/23364a.tif
216951362 data/23364a.tif
(lists URLs where payload files can be retrieved from in addition or to
replace payload files in the 'data' directory)
More information about BagIt be found on its Wikipedia page (https://en.wikipedia.org/wiki/BagIt), or in RFC 8493 (https://www.rfc-editor.org/rfc/rfc8493).