[SCC_Active_Members] Subversion as a basis for software archive

Fri Apr 6 18:34:31 PDT 2007

Hi,

  At last week's SPC meeting I kicked off a discussion about the 
technology for maintaining the software digital archive.  I proposed 
that we use a Software Configuration Management (SCM) tool to 
maintain the archive, and to make the proposal more complete, 
suggested that Subversion is a good choice.  I am quite open to other 
possible choices, but use Subversion as a starting point for 
comparing other options.

The resulting discussion brought up two alternative methods of 
archival storage.  One of these is a file-system based method, where 
changes to a file are stored as update logs, but large numbers of 
changes aren't anticipated.  The second is a comprehensive system for 
archiving and searching, and managing the archived assets.

Just to be clear, when I refer to digital artifacts in this note, I'm 
including certain files relating to the actual bit transcription of 
the delivered artifact.  For example, I include both the original bit 
image of software as delivered to the museum, and a transcription of 
that image to a contemporary character set.  In addition to a scanned 
image of a manual, I include an OCR-generated version of that 
manual.  These files must also be controlled by the museum staff, and 
associated with each other.  Whether they are viewed as being part of 
the archive or a second type of store maintained by the museum isn't 
being addressed here.  They are all regarded as being artifacts in the archive.

There are two important aspects of Subversion, or other contemporary 
SCM tools.  One of these is a mechanism for storing and retrieving 
versioned files.  The second is a way of accessing files, setting up 
working versions of the files, updating to a new version (where 
permitted), and maintaining consistency among working copies of files.

A question raised during the discussion was how well the SCM system 
would work for files that don't change.  It works fine.  Many files 
within a software system remain static after they have been 
established, and there is no penalty for having such files in the 
system.  At the same time, those files that change (such as 
correcting a transcription error in an OCR-generated document) can 
also be handled.  With the versioning system, it is possible to 
determine the contents of a directory as they were at some past point 
in time.  Finally, from the client's perspective, one system can be 
used to access files in the museum archive that do not normally 
change, and files from a project where changes often do take place.

Subversion is now a well-established system.  There are 3 principal 
developers, but about 70 developers have worked on it.  There appear 
to be several hundred people who have contributed money to the 
project.  So there is substantial effort put into seeing that the 
system operates well, and into seeing that file storage of multiple 
media types is efficient.

I couldn't locate the number of Subversion installations.  But one of 
the front-end tools that works with SVN -- TortoiseSVN -- announced 
that they had 500,000 downloads during the past 6 weeks.  This is a 
large enough community that any future replacement of the SVN 
technology will need to consider migration issues.

Subversion is also an interface for accessing files.  The basic 
interface is a command interface.  Access may be made on a remote 
computer, or there is a web interface for accessing a directory 
structure from the web.  I've only looked at the most basic access 
control, but registered users can either be given read access or 
read-write access.   For the museum, I would expect that very few 
people would have write access to the archive itself, but when 
Subversion is used for projects then project members could have write access.

When Subversion is used on Windows through TortoiseSVN, files 
accessed from archives have a special icon.  A single operation can 
update all of the files within a directory and its 
subdirectories.  These can be pulled from one archive or many, from 
one location in the archive or from many, and from one or more 
project directories containing the work of a team.  Similarly, when a 
project member has made modifications, the changes can be checked in 
with a single operation, assuming that the person has write access.

I have concerns about a comment made during the meeting.  The 
suggestion was made that if you want to access a file in the archive, 
you just copy it.  In my experience, should a file change, having 
multiple copies can lead to many questions.  Is this a copy that 
pre-dated the archive?  Has this copy been updated to reflect an 
unanticipated change to the archive?  Is this copy intentionally 
modified from the copy stored in the archive?  These questions can 
all be answered when using a system like Subversion.

Two other alternatives were discussed.  I have not at this time had a 
chance to research them.  My comments are simply based on the 
statements made in the meeting and my past experience.

One system was a simple file system approach.  Changes in the archive 
were anticipated, as there was a technique of writing a serial stream 
of files, where a more recent file could be an update to a file 
written earlier.  A comment made during the session suggested that 
this is a technique developed within the museum.  My main concern is 
that there would need to be documentation explaining this, that 
different file types will need different ways of storing differences, 
and that the implementation will not be well understood to preserve 
the archive for hundreds of years.  I see the subversion approach as 
having a better basis for survivability, and that it will be easier 
to find people who already understand its workings to come and assist 
with the archiving effort.  The linear nature of the file system 
approach is modelled with modification dates and with database 
entries in the case of Subversion.

I am also concerned about client software.  If a client wants to read 
a file from the archive, they'd need to have software for accessing 
the archive files.  If changes are made to the archive, new client 
software will need to be distributed.  This problem is already solved 
for Subversion.

The other proposal is a more comprehensive proposal for higher level 
software for managing a museum's virtual archive.  (Sorry if I have 
not characterized it properly.)

My biggest reservation with this system is that we would need to 
decide all aspects of the museum system before starting to do 
anything.  It is like purchasing a complete home theater rather than 
buying the components and putting them together.  While you might 
imagine doing that today, think about doing such a thing 20 years ago 
when many of the technologies were still emerging.  What I'm most 
concerned about is delaying the establishment of an archive 
technology while deciding whether the remaining technology is what we 
really want.

One aspect of the technology that I like very much is that it is 
layered.  Layering allows us to select an implementation for one 
layer independently of the implementation of other layers.  A key to 
layering is having interface standards between the layers; it was 
commented that this standard is underway and nearing completion.

Comments at the meeting suggested that Subversion provides much of 
the functionality needed at the innermost layer.  I hope that 
developers of this technology would therefore have at least one 
implementation based on Subversion.  Given the choice of one 
implementation based on software being used by millions of people vs 
waiting for someone to implement almost the same thing and then be a 
user of this new software, I would select the former.

While it is good to know that this inner layer is based on a 
standard, there are many cases where a standard is defined but 
doesn't take, or is surpassed soon after its adoption by another 
standard.  I don't know what will happen here, but there is some 
cause to be cautious about jumping on the standard too early, or at 
least tying too much to that standard until it proves itself.

A greater concern comes with the selection of the implementation of 
that standard.  There was no mention of a storage standard that would 
be used by all developers implementing this standard.  So if we make 
the wrong choice, we may find that the implementation we initially 
choose does not survive, and we will be faced with choosing a new 
implementation and converting the archive.  There may be no 
conversion software available, so we may need to create it 
ourselves.  By choosing a core technology like Subversion, migration 
will be faced by enough people that products will be available to 
help us.  Also, the mechanism of Subversion lies below the file 
system, so most programs can just access a local copy of the 
versioned files.  They don't need to be aware of the versioning 
mechanism that lies beneath.

This brings up an interesting topic.  How do we see the archive being 
used?  Suppose that there are 100 researchers working with the 
archive, searching it and bringing some files into their projects of 
whatever sort.  Not all researchers are within the museum, but all 
have been given access credentials by the museum.  Here are two 
approaches for supporting this.

The first approach has an archive that is accessible to all of the 
researchers.  They do searches using a search engine that has 
analyzed the files.  When they find an interesting artifact, they go 
to the archive file to retrieve it.  Some of these accesses will be 
from within the museum, others will be made by researchers halfway 
around the world.  The archive is also being used by museum staff to 
add new artifacts.

The second approach creates a copy of the archive that is used by the 
researchers.  The copy is periodically synchronized with the master 
archive.  The museum staff adds new artifacts to the master 
archive.  The search engine updates its database when the 
synchronization operation takes place.

The museum staff should decide whether having an additional copy of 
the archive provides extra security they view as being useful.

Bob