[SCC_Active_Members] Software Archive Solution

Tue Apr 10 23:35:10 PDT 2007

There is a difference between my proposal and the examples given 
here.  In the examples some media has been written and left 
unattended for years.  In my proposal, the master archive is active, 
kept in an operating computer.  So if storage technology changes, the 
archive will need to be copied to the new media.  Of course, if the 
archive were ever to be taken off line, the backup media would become 
the archive.  By doing periodic full backups, the technology for 
storing the backups will be recent technology.  The older artifacts 
will be available on recording media that is contemporary with this 
event, rather than media contemporary with the artifact itself.

I am not trying to solve all aspects of an archive, but create a 
structure into which future work can be added.  So let me address the 
other issues that you raise as a possible approach that goes beyond 
my proposal.  I have artifacts  stored as directories, rather than 
single files.  So, for example, a 1620 card deck might have a binary 
file containing the hole images, an ASCII translation, and a Unicode 
translation.  There might be multiple interpretations of the 
character set in a single format, if that's appropriate.

When new file formats are created, an automatic process could go 
through the archive to generate the new file formats.  Similarly, 
audio or graphics files could be converted to new formats while the 
conversion programs are available.  The archives that you reference 
were allowed to sit idle until the formats were no longer in use, and 
any conversion programs that may have existed were no longer 
available.  Clearly, the curators will need to remain alert to apply 
the appropriate translations in a timely manner.  Formats that are no 
longer in use can be removed from the archive, but the core data (in 
the 1620 case the binary record of all holes in the card) should 
never be removed.

Bob

At 11:08 AM 4/9/2007, Richard P. Gabriel wrote:
>I agree with all the stuff Larry Masinter has  talked about. He has 
>clearly looked into these issues thoroughly.  The OAIS document 
>talks about most of the problems I can see that would need to be 
>addressed in asoftware-related museum archive.
>
>The main requirement I see is missing from your list below  has to 
>do with the Long Term aspects mentioned in that report. Namely, 
>infrastructure, standards, and encoding drift.  Suppose that the CHM 
>had been operating since 1960 and had archived all software-related 
>stuff from 1945 until 1965. Would I be able to look at it today - 
>since only IBM (and Fugitsu and Siemens?) mainframes still use the 
>data encodings in use then? Would the archive be on one of those 
>machines or other circa 1960 machines? And if I can look at the 
>materials on a 2007 machine, how would you have gotten from where 
>you were in 1965 to this state today? Was that part of the 
>requirements back then, or was it left for future archivists to deal 
>with? And if the materials had been translated from, say, SIXBIT to 
>one of the pre-ANSI Asciis  to ascii or unicode, how would we/I know 
>it was done accurately or in a way that preserves the information a 
>historian might need? As someone else asked: how would it be verified?
>
>The Stanford AI Lab computer, SAIL, was decommissioned not much more 
>than 10 years ago, maybe 15. I used that machine heavily from 1975 
>through 1985 and some beyond that. A couple of SAIL hackers spent a 
>couple of years preparing archives of all the files ever stored on 
>that machine from the backup tapes. SAIL used a pre-ascii character 
>set (encoded in 9 bits per character), and I cannot really read a 
>lot of the stuff that I had on that machine in the archive because 
>the mathematical characters don't show up etc in the transcoding 
>these experts chose (which was an html encoding). This is the 
>problem you could face.
>
>                         -rpg-
>
>At 8:04 -0700 4/9/07, Bob Fraley wrote:
>>I think that we have all of the elements required for a solution to 
>>the archive problem.  Here's my cut.  It is intended to serve as a 
>>basis for further discussion.
>>
>>1.  Master Archive
>>
>>The master archive is a collection of files stored on a RAID array. 
>>It is directly accessible only by the museum staff.  Each artifact 
>>is a file, with an associated Info file.  (I want to distinguish 
>>this from the metadata regarding the artifact.  It may or may not 
>>include all of the metadata; I don't want to bias the design at 
>>this point to assume that it does.  The metadata might be submitted 
>>as a separate "artifact" file.)  There are no difference files or 
>>update files:  if a modification is required to an artifact, the 
>>complete modified file is submitted.  Once entered, a file is never removed.
>>
>>2.  MD5 checksums
>>
>>An MD5 checksum is computed for each file.  These checksums are 
>>stored in a separate portion of the archive and also in the Info file.
>>3.  Disaster Recovery
>>
>>Because the master archive is on dynamic disk media accessed 
>>through a computer, there is a disaster recovery site that contains a mirror.
>>
>>4.  Backup
>>
>>The archive is backed up on DVD's and/or magnetic tape.  (Perhaps 
>>the main site uses one media, the recovery site uses another?) 
>>Nightly incremental backups are made.  Due to issues with the 
>>longevity of the media, full backups are done periodically.  The 
>>archive can be reconstructed from the full backup plus increments.
>>
>>
>>5.  Working Archive
>>
>>There are one or more working archives.  When updates are made to 
>>the Master Archive, it will send the updates to the working 
>>archive. The master archive knows how to access the working 
>>archives; the working archives have no means of accessing the 
>>master archive, so even if the machines are penetrated in an 
>>attack, the cannot obtain credentials for accessing the master archive.
>>
>>A possible basis for the working archive is Subversion.  Again, I'm 
>>using it to make things more concrete.
>>
>>When files are added from the archive, the Info file contains all 
>>of the information required to hook the new file into the 
>>Subversion file tree.  This includes such things as the artifact 
>>number, the file name within the artifact directory, the person 
>>creating the entry, and comments about the new entry or reasons for 
>>changing an existing file.  If metadata is stored as a separate 
>>file, the Info information indicates that this is a metadata file, 
>>and the data is installed in the system in the appropriate manner.
>>
>>6.  Protecting the Working Archive.
>>
>>The working archive may be available on the internet, but only to 
>>relatively few people, such as people who manage projects.  The 
>>only access to the working archive is read access.  Should there be 
>>any question about whether a file in the working archive has been 
>>compromised, the MD5 checksum can be accessed to verify it.  MD5 
>>checksums from the archive should be made available on a separate 
>>system than the working archive.
>>
>>7.  Projects and Internet Access
>>
>>Projects all work off of their own copies of the working archive. 
>>Public internet access to the archive is through a project, thus 
>>giving control over the artifacts that are made public.  Software 
>>projects should also use Subversion (the same system being used for 
>>the working archive) so that a user can check files out of either 
>>the working archive or the project using the same technology 
>>(recommendation, not a requirement).
>>
>>Bob
>>
>>
>>_______________________________________________
>>SCC_active mailing list
>>SCC_active at computerhistory.org
>>http://mail.computerhistory.org/mailman/listinfo/scc_active
>