July 27, 1999

Cornell computer scientists, librarians collaborate on system to manage digital collections

Cornell's computer scientists and librarians will form an unusual partnership to develop better ways to manage and ensure the integrity of documents and other data in the digital library of the future. The group has received a $2.2 million, four-year grant from the National Science Foundation, the National Endowment for the Humanities and other agencies to develop a working prototype digital library system with built-in mechanisms to preserve documents, protect intellectual property rights and permit interconnections with other digital library systems worldwide.

While several studies have been done to determine the future needs of digital library systems, this may be the first effort to build a working system that can enforce a wide range of security and preservation policies to protect valuable resources in a globally distributed digital library environment. The project will build on research done over the past several years by the Digital Library Research Group in the Cornell Department of Computer Science. Principal investigators are Carl Lagoze, digital library scientist in the computer science department, Sarah Thomas, the Carl A. Kroch University Librarian at Cornell, and computer science professors Ken Birman, an expert on distributed systems, and Fred Schneider, a specialist in computer security.

Anne Kenney, associate director of Cornell University Library's Department of Preservation and Conservation, will head the library side of the project. Others involved are Sandy Payette, a researcher in the Digital Library Research Group of the computer science department and technical leader for the project; Oya Rieger, coordinator of the library's Digital Imaging and Preservation Research Unit, who will define the library's policy requirements, and Geri Gay, associate professor of communication and director of the Human-Computer Interaction Laboratory in the Department of Communications, who will direct an evaluation of the results.

The ability of Cornell's librarians and computer scientists to collaborate on the project was a key factor in obtaining the grant, Thomas said. "We've worked really hard [together] over the last two years," she said. "The library will serve as a test bed and a real-life check on what the computer scientists are developing."

"I'm particularly pleased that this is a collaborative effort," Kenney said, noting that in the past computer scientists have often come up with "elegant solutions" that weren't practical, while librarians have designed systems that met their needs but were technically difficult to put into practice.

The problems are summed up in the acronym Lagoze has coined for the project, PRISM, which stands for preservation, reliability, interoperability, security and metadata. As more and more documents are being stored in digital form, concern has grown over their preservation, largely because rapidly changing technology can make today's digital documents unreadable by tomorrow's computers. But librarians also worry about the problems involved in distributing digital data over the Internet and other networks. Not only must they deal with dozens of different systems for storing and reading data, but also with differing policies of library systems and with the protection of intellectual property rights.

An important part of the solution lies in metadata -- literally, data about data. In the terms used by computer scientists, each body of information is an "object," and each object will have metadata attached to it that identifies the way it was created, what system it is stored on, who owns it and what privileges the owner has granted. According to Payette, the future digital library system will include programs which read the metadata and act according to its information. For example, one approach to preservation of documents might be programs that periodically scan library objects to see if see if they are in any kind of danger and automatically launch others that take corrective action, which might include copying the data to a new format.

The project will begin with careful planning about what should be included in the metadata, Payette said. Actual programs will be built on a prototype system already developed at Cornell called the Cornell Reference Architecture for Distributed Digital Libraries, or CRADDL (pronounced "cradle"). The system will be tested on several digital collections already in existence at Cornell and on several other collections located around the world, representing a variety of different data formats.

The grant to Cornell is part of the Digital Libraries Initiative, a joint project of the National Science Foundation, the Defense Advanced Research Projects Agency, the National Library of Medicine, the Library of Congress, NASA, the National Endowment for the Humanities and the National Archives and Records Administration of the Smithsonian Institution.