Economist article/Sept 22 Technology Quarterly
Review
SAFE KEEPING
Digital archival repositories, swapping data Napster-style among
themselves, could ensure that today's records are kept up-to-date
and saved for future generations
THE sands of time may have left intact the stone-chiselled Egyptian
hieroglyphics from 2000BC, but a portion of the original census
reports of the United States of America for as recent a year as
1960--recorded on UNIVAC type II-A tapes--is now lost forever. Every
day, important parts of the world's intellectual record vanish because
of failures of the recording systems and media, the recording format
becoming obsolete, or publishers who own the material going out
of business, as well as the digital rewriting of history and the
burning of digital records as political regimes come and go.
Hence all the effort now going into designing digital archival
repositories (DARs) as a way of protecting digital information from
corruption or destruction. The idea is to have a widely distributed
network of independent repositories, connected via the Internet,
that can make copies of each digital object stored in one another's
archive and then spread them around to ensure that they are preserved.
To see how expensive this would be and to solve some of the uncertainties
associated with preserving digital documents for centuries, Brian
Cooper, Arturo Crespo and Hector Garcia-Molina at Stanford University
are building the grand-daddy of all digital warehouses, the Stanford
Archival Vault (SAV). In SAV, each digital object is assigned a
numerical "handle" when added to a repository. A key property
of the handle is that it is computed as a function of the bits of
information in the object. Using this property, each object can
be tracked in the network of repositories, since each replica of
the object will have the same signature, and therefore the same
handle. By design, deletions are simply not allowed, so digital
objects are saved from ever being "burned" even if they
fall out of favour with society.
Based on these properties, SAV offers what programmers call "application
layers" that allow them to write software to help operate an
archive. SAV also has a "view layer" which lets users
define additional ways of looking at the DAR's underlying data.
If necessary, these so-called "auxiliary structures" can
also be stored in the SAV or simply deleted when no longer needed.
Another SAV feature is its "reliability layer", which
ensures that the various mirror sites that store replicas of the
data (say, the Library of Congress, Stanford Digital Library or
Tokyo National Library) are complete and up-to-date.
While DARs are getting a good deal of attention, they are being
used mostly for "data preservation" rather than retrieval
and active research. This is analogous to preserving the stone-chiselled
hieroglyphics on Egyptian obelisks in the British Museum. However,
no Rosetta Stone is yet being constructed as a means for deciphering
the data. Because of the linguistic issues involved, such "semantic
preservation" is tough enough even if users know whether the
data were written in ASCII, UTF-8, EBCDIC or some other digital
code used for formatting data.
Perhaps the closest people have come to devising a Rosetta Stone
for the digital world is XML (extensible markup language ), which
marks the data with tags that define the content in an agreed way
and in a form that can be read easily by human beings. If the digital
bits are preserved in SAV, and if their descriptive tags do not
lose their meaning over time, digital pictures of man's first landing
on the moon, records of the horrors of the second world war, and
MP3s of "Yellow Submarine" could be preserved for future
generations. How much people then will want to hear or see such
things is another matter.
|