Digital preservation of information is important from many perspectives. Companies and government agencies are increasingly being required to store their content in archival-quality digital formats to ensure their records and history are preserved accurately and comprehensively. Considering the rising prevalence of born-digital documents – or, those that never existed as physical objects – it has become clear that future preservation efforts will have to standardize in digital, rather than physical, formats.
Defining Long-Term Preservation of Records
Agencies such as the Library of Congress have committed to the long-term preservation of digital materials. The LOC’s National Digital Information Infrastructure Preservation Program (NDIIPP), now inactive, was important in setting and assessing the standards and methodologies for creating trusted digital repositories. Selecting reliable and sustainable formats is essential when archiving content, and the LOC offers a recommended list of formats based on formats based on the type of data in question.
The International Standards Organization (ISO) maintains specifications for Open Archival Information Systems (OAIS). The standard in question, ISO 14721, defines “long term” as covering enough time that the organizations and personnel maintaining the archive will have to think about technologies changing, meaning that it can be indefinite. The information in the OAIS should be stored in such a way that it will be preserved and retrievable, despite the evolution of IT norms.
ISO specified that OAIS creation consists of technological, legal, industrial and scientific perspectives. One of the most important ideas involved in OAIS creation, however, is based in the world of technology: What kinds of formats should organizations turn to when placing their records into storage to ensure they’ll be indefinitely accessible? In cases of page-based technical records, the answer often involves more ISO-maintained standards, namely those around PDF documents.
Preferred Long-Term Storage Formats for Textual Records
The LOC has maintained its Recommended Formats Statement since 2014, with the ambitious but relevant goal of providing a model to assist internal and external teams in the preservation of all kinds of knowledge and information. The list is updated over time, because the LOC acknowledges that the amount of works being generated around the country and the world is large and growing, and the digital methods being used to accomplish this output are diversifying.
While the Recommended Formats Statement has a focus on preserving creativity and cultural knowledge, its objective of accurate and comprehensive information storage also applies to more straightforward forms of archival recordkeeping. The LOC suggestion for document formats depends on whether the content is laid out in a page format or not. While documents not set up as pages are kept in XML-based markup formats, archival PDFs are the main choice for documents using pages.
There are three versions of the PDF standard recommended by the LOC for the long-term storage of page-formatted digital content. Each represents a slight variation, with features designed to suit the needs of users.
- PDF/UA: This is a version of the archival PDF standard designed with accessibility in mind. Graphics objects are tagged with replacement text to indicate what is depicted, and no information in the files is conveyed by its color, contrast format or layout unless it is tagged. The content and its reading order should be conveyed with tags. These tagging guidelines ensure that reading software designed for users with disabilities can convey the information stored within the PDF/UA document. The considerations come from the Web Content Accessibility Guidelines 2.0. This level of usability is not guaranteed with alternative PDF formats. Some government agencies require that their document authoring software be able to export in PDF/UA.
- PDF/A: The PDF/A format is designed for the long-term preservation of content. It is deigned to be constrained in the types of information stored, to ensure the contents of the document will be accessible in a consistent way at a later date, however far in the future that is. There are three versions within this standard, PDF/A-1 and the slightly updated PDF/A-2 and PDF/A-3, which has not replaced their predecessor. Furthermore, there are three levels of conformance – A indicates the standard has been met in full, while B is a lowered level of conformance. U means every piece of text in the document has a Unicode equivalent. Many government agencies use PDF/A for their archiving purposes, and it is a preferred format of the National Archives and Records Administration.
- PDF: The LOC’s third recommendation for archiving in PDF format doesn’t contain the archive-friendly features of the others. The organization did temper its recommendation, however, adding that users should make sure to use the highest available quality of PDF, and employ searchable text, embedded fonts, compression with no loss of quality, high-resolution images, colorspace specifications that extend to all device types and tagging. The LOC specified that this includes the use of PDF/X, a prepress format designed for graphics implementations.
Converting Files for Long-Term Preservation and Archiving
Whether required by a federal government regulation such as NARA’s Federal Electronic Records Modernization Initiative (FERMI) or acting on an internal initiative to create an OAIS, organizations need reliable ways to create consistent archives based around consistent and approved formats. The end result of converting all records to an archival format such as PDF/A stands to deliver advantages beyond simple compliance considerations. PDF/A documents are easy to transmit via email and simple to search, saving man-hours that would otherwise be wasted. Furthermore, these files support compression that can reduce storage space needs – and the associated costs.
Organizations should search for conversion methods that take as little time and manual effort as possible to use, ensuring personnel don’t have to spend large amounts of time and effort on using software tools. The ideal solution delivers maximum efficiency through batch processing and integration with other storage and data management systems, and can scale up when there is a need to handle large amounts of records, such as when an organization digitizes an entire scanned archive of documents. Especially when dealing with physical papers converted to digital to formats, optical character recognition (OCR) is essential to ensure the resulting content is internally text-searchable.
Reach out to Foxit to learn more about the best way to convert all your records into a consistent and future-proof PDF/A format.