Cornell University Library receives $275,000 in grants to help secure the future of digital documents
By Bill Steele
For about a decade now, librarians have been working to preserve deteriorating books, magazines and other documents by scanning and saving digital images of their pages as computer data. Meanwhile, the world continues to create new documents in digital form. The trouble is, those digital records may turn out to be even more fragile and short-lived than the old, brittle paper.
Two research projects under way at Cornell University seek to ensure that digital information is preserved. Cornell University Library has received a grant of $75,000 from the Council on Library and Information Resources to support development of "risk management" tools to help librarians decide how best to manage their digital data, and a $200,000 grant from the Federal Institute of Museum and Library Services (IMLS) to design and implement a plan for long-term preservation of the documents the library already has in digital form, which include nearly 3 million scanned pages.
"There are a lot of reports talking about technical obsolescence and preserving files, but few actual implementations," says Anne Kenney, associate director of Cornell Library's Department of Preservation and Conservation, who will direct the IMLS preservation project.
Both projects grow out of the problems of hardware and software obsolescence. In the fast-moving computer industry, replacement parts for the disk storage system purchased 10 years ago might no longer be available, or the company that made it might be out of business. Worse yet, the computer file formats in which images are stored could become obsolete. It's the computer equivalent of trying to play an 8-track music tape.
There are two ways to deal with obsolete file formats, says Gregory Lawrence, government information librarian in Cornell's Mann Library, who directs the risk management project. One is "emulation," using software that can read old formats, or that will allow old programs to run on newer systems. The other is "migration," in which documents are copied from older formats to newer ones as hardware and software evolve.
Cornell has chosen a strategy based on migration, Kenney says. A key element in this strategy is to add "metadata" to each file. Metadata is "data about data," that is, information about what's in the file, where it came from and what format it's in. Examples are: "These files were created in 1996 on XDOD scanners at 600 dots per inch resolution, one-bit tonal range," or management data like "This belongs to Cornell University." The library also will develop a method of naming and locating files that will endure over changes in the format and location.
The project also follows the rule that "use begets preservation." This means, says Kenney, that the master copy of any document will be the one people actually use to get at the information. That way, if the hardware or software used to maintain that document develop problems, user alerts will prompt an immediate response.
But there are hazards in migration, and evaluating them is the purpose of the risk management project, which will look at what data might be lost when files are copied from one format to another, particularly in spreadsheets and TIFF-format images. For example, Lawrence says, different versions of the Lotus spreadsheet program might have different ways of storing large numbers, so that when a spreadsheet is copied from one version to another the value of some cells might change in small but important ways.
Lawrence has so far found that nothing is lost in copying image files, but that there are differences in the programs people will use to read the images. Image files contain a number of "tags" with information about the images, and not all image-reading programs read all tags, he explains. Just which tags are really important is yet to be determined, he adds.
The study, Lawrence points out, applies only to currently available software and can't predict what problems might arise in the future. "It's not going to be the definitive statement on TIFFs and spreadsheets," he says, "but it's going to be a methodology that people can use to evaluate risks. Cornell University Library believes some form of risk management must replace 'heroic rescue' as a means of preserving digital information. "
Media Contact
Get Cornell news delivered right to your inbox.
Subscribe