[SCC_Active_Members] Software Archive Solution
Bob Fraley
fraley at acm.org
Mon Apr 9 08:04:32 PDT 2007
I think that we have all of the elements required for a solution to
the archive problem. Here's my cut. It is intended to serve as a
basis for further discussion.
1. Master Archive
The master archive is a collection of files stored on a RAID
array. It is directly accessible only by the museum staff. Each
artifact is a file, with an associated Info file. (I want to
distinguish this from the metadata regarding the artifact. It may or
may not include all of the metadata; I don't want to bias the design
at this point to assume that it does. The metadata might be
submitted as a separate "artifact" file.) There are no difference
files or update files: if a modification is required to an artifact,
the complete modified file is submitted. Once entered, a file is
never removed.
2. MD5 checksums
An MD5 checksum is computed for each file. These checksums are
stored in a separate portion of the archive and also in the Info file.
3. Disaster Recovery
Because the master archive is on dynamic disk media accessed through
a computer, there is a disaster recovery site that contains a mirror.
4. Backup
The archive is backed up on DVD's and/or magnetic tape. (Perhaps the
main site uses one media, the recovery site uses another?) Nightly
incremental backups are made. Due to issues with the longevity of
the media, full backups are done periodically. The archive can be
reconstructed from the full backup plus increments.
5. Working Archive
There are one or more working archives. When updates are made to the
Master Archive, it will send the updates to the working archive. The
master archive knows how to access the working archives; the working
archives have no means of accessing the master archive, so even if
the machines are penetrated in an attack, the cannot obtain
credentials for accessing the master archive.
A possible basis for the working archive is Subversion. Again, I'm
using it to make things more concrete.
When files are added from the archive, the Info file contains all of
the information required to hook the new file into the Subversion
file tree. This includes such things as the artifact number, the
file name within the artifact directory, the person creating the
entry, and comments about the new entry or reasons for changing an
existing file. If metadata is stored as a separate file, the Info
information indicates that this is a metadata file, and the data is
installed in the system in the appropriate manner.
6. Protecting the Working Archive.
The working archive may be available on the internet, but only to
relatively few people, such as people who manage projects. The only
access to the working archive is read access. Should there be any
question about whether a file in the working archive has been
compromised, the MD5 checksum can be accessed to verify it. MD5
checksums from the archive should be made available on a separate
system than the working archive.
7. Projects and Internet Access
Projects all work off of their own copies of the working
archive. Public internet access to the archive is through a project,
thus giving control over the artifacts that are made
public. Software projects should also use Subversion (the same
system being used for the working archive) so that a user can check
files out of either the working archive or the project using the same
technology (recommendation, not a requirement).
Bob
More information about the SCC_active
mailing list