[SCC_Active_Members] Software Archive Solution

Bob Fraley fraley at acm.org
Mon Apr 9 08:04:32 PDT 2007


I think that we have all of the elements required for a solution to 
the archive problem.  Here's my cut.  It is intended to serve as a 
basis for further discussion.

1.  Master Archive

The master archive is a collection of files stored on a RAID 
array.  It is directly accessible only by the museum staff.  Each 
artifact is a file, with an associated Info file.  (I want to 
distinguish this from the metadata regarding the artifact.  It may or 
may not include all of the metadata; I don't want to bias the design 
at this point to assume that it does.  The metadata might be 
submitted as a separate "artifact" file.)  There are no difference 
files or update files:  if a modification is required to an artifact, 
the complete modified file is submitted.  Once entered, a file is 
never removed.

2.  MD5 checksums

An MD5 checksum is computed for each file.  These checksums are 
stored in a separate portion of the archive and also in the Info file.
3.  Disaster Recovery

Because the master archive is on dynamic disk media accessed through 
a computer, there is a disaster recovery site that contains a mirror.

4.  Backup

The archive is backed up on DVD's and/or magnetic tape.  (Perhaps the 
main site uses one media, the recovery site uses another?)  Nightly 
incremental backups are made.  Due to issues with the longevity of 
the media, full backups are done periodically.  The archive can be 
reconstructed from the full backup plus increments.


5.  Working Archive

There are one or more working archives.  When updates are made to the 
Master Archive, it will send the updates to the working archive.  The 
master archive knows how to access the working archives; the working 
archives have no means of accessing the master archive, so even if 
the machines are penetrated in an attack, the cannot obtain 
credentials for accessing the master archive.

A possible basis for the working archive is Subversion.  Again, I'm 
using it to make things more concrete.

When files are added from the archive, the Info file contains all of 
the information required to hook the new file into the Subversion 
file tree.  This includes such things as the artifact number, the 
file name within the artifact directory, the person creating the 
entry, and comments about the new entry or reasons for changing an 
existing file.  If metadata is stored as a separate file, the Info 
information indicates that this is a metadata file, and the data is 
installed in the system in the appropriate manner.

6.  Protecting the Working Archive.

The working archive may be available on the internet, but only to 
relatively few people, such as people who manage projects.  The only 
access to the working archive is read access.  Should there be any 
question about whether a file in the working archive has been 
compromised, the MD5 checksum can be accessed to verify it.  MD5 
checksums from the archive should be made available on a separate 
system than the working archive.

7.  Projects and Internet Access

Projects all work off of their own copies of the working 
archive.  Public internet access to the archive is through a project, 
thus giving control over the artifacts that are made 
public.  Software projects should also use Subversion (the same 
system being used for the working archive) so that a user can check 
files out of either the working archive or the project using the same 
technology (recommendation, not a requirement).

Bob





More information about the SCC_active mailing list