[SCC_Active_Members] Software Archive Solution

Mon Apr 9 11:08:29 PDT 2007

I agree with all the stuff Larry Masinter has  talked about. He has 
clearly looked into these issues thoroughly.  The OAIS document talks 
about most of the problems I can see that would need to be addressed 
in asoftware-related museum archive.

The main requirement I see is missing from your list below  has to do 
with the Long Term aspects mentioned in that report. Namely, 
infrastructure, standards, and encoding drift.  Suppose that the CHM 
had been operating since 1960 and had archived all software-related 
stuff from 1945 until 1965. Would I be able to look at it today - 
since only IBM (and Fugitsu and Siemens?) mainframes still use the 
data encodings in use then? Would the archive be on one of those 
machines or other circa 1960 machines? And if I can look at the 
materials on a 2007 machine, how would you have gotten from where you 
were in 1965 to this state today? Was that part of the requirements 
back then, or was it left for future archivists to deal with? And if 
the materials had been translated from, say, SIXBIT to one of the 
pre-ANSI Asciis  to ascii or unicode, how would we/I know it was done 
accurately or in a way that preserves the information a historian 
might need? As someone else asked: how would it be verified?

The Stanford AI Lab computer, SAIL, was decommissioned not much more 
than 10 years ago, maybe 15. I used that machine heavily from 1975 
through 1985 and some beyond that. A couple of SAIL hackers spent a 
couple of years preparing archives of all the files ever stored on 
that machine from the backup tapes. SAIL used a pre-ascii character 
set (encoded in 9 bits per character), and I cannot really read a lot 
of the stuff that I had on that machine in the archive because the 
mathematical characters don't show up etc in the transcoding these 
experts chose (which was an html encoding). This is the problem you 
could face.

			-rpg-

At 8:04 -0700 4/9/07, Bob Fraley wrote:
>I think that we have all of the elements required for a solution to 
>the archive problem.  Here's my cut.  It is intended to serve as a 
>basis for further discussion.
>
>1.  Master Archive
>
>The master archive is a collection of files stored on a RAID array. 
>It is directly accessible only by the museum staff.  Each artifact 
>is a file, with an associated Info file.  (I want to distinguish 
>this from the metadata regarding the artifact.  It may or may not 
>include all of the metadata; I don't want to bias the design at this 
>point to assume that it does.  The metadata might be submitted as a 
>separate "artifact" file.)  There are no difference files or update 
>files:  if a modification is required to an artifact, the complete 
>modified file is submitted.  Once entered, a file is never removed.
>
>2.  MD5 checksums
>
>An MD5 checksum is computed for each file.  These checksums are 
>stored in a separate portion of the archive and also in the Info 
>file.
>3.  Disaster Recovery
>
>Because the master archive is on dynamic disk media accessed through 
>a computer, there is a disaster recovery site that contains a mirror.
>
>4.  Backup
>
>The archive is backed up on DVD's and/or magnetic tape.  (Perhaps 
>the main site uses one media, the recovery site uses another?) 
>Nightly incremental backups are made.  Due to issues with the 
>longevity of the media, full backups are done periodically.  The 
>archive can be reconstructed from the full backup plus increments.
>
>
>5.  Working Archive
>
>There are one or more working archives.  When updates are made to 
>the Master Archive, it will send the updates to the working archive. 
>The master archive knows how to access the working archives; the 
>working archives have no means of accessing the master archive, so 
>even if the machines are penetrated in an attack, the cannot obtain 
>credentials for accessing the master archive.
>
>A possible basis for the working archive is Subversion.  Again, I'm 
>using it to make things more concrete.
>
>When files are added from the archive, the Info file contains all of 
>the information required to hook the new file into the Subversion 
>file tree.  This includes such things as the artifact number, the 
>file name within the artifact directory, the person creating the 
>entry, and comments about the new entry or reasons for changing an 
>existing file.  If metadata is stored as a separate file, the Info 
>information indicates that this is a metadata file, and the data is 
>installed in the system in the appropriate manner.
>
>6.  Protecting the Working Archive.
>
>The working archive may be available on the internet, but only to 
>relatively few people, such as people who manage projects.  The only 
>access to the working archive is read access.  Should there be any 
>question about whether a file in the working archive has been 
>compromised, the MD5 checksum can be accessed to verify it.  MD5 
>checksums from the archive should be made available on a separate 
>system than the working archive.
>
>7.  Projects and Internet Access
>
>Projects all work off of their own copies of the working archive. 
>Public internet access to the archive is through a project, thus 
>giving control over the artifacts that are made public.  Software 
>projects should also use Subversion (the same system being used for 
>the working archive) so that a user can check files out of either 
>the working archive or the project using the same technology 
>(recommendation, not a requirement).
>
>Bob
>
>
>_______________________________________________
>SCC_active mailing list
>SCC_active at computerhistory.org
>http://mail.computerhistory.org/mailman/listinfo/scc_active