[SCC_Active_Members] Subversion as a basis for software archive
Bob Fraley
fraley at acm.org
Fri Apr 6 18:34:31 PDT 2007
Hi,
At last week's SPC meeting I kicked off a discussion about the
technology for maintaining the software digital archive. I proposed
that we use a Software Configuration Management (SCM) tool to
maintain the archive, and to make the proposal more complete,
suggested that Subversion is a good choice. I am quite open to other
possible choices, but use Subversion as a starting point for
comparing other options.
The resulting discussion brought up two alternative methods of
archival storage. One of these is a file-system based method, where
changes to a file are stored as update logs, but large numbers of
changes aren't anticipated. The second is a comprehensive system for
archiving and searching, and managing the archived assets.
Just to be clear, when I refer to digital artifacts in this note, I'm
including certain files relating to the actual bit transcription of
the delivered artifact. For example, I include both the original bit
image of software as delivered to the museum, and a transcription of
that image to a contemporary character set. In addition to a scanned
image of a manual, I include an OCR-generated version of that
manual. These files must also be controlled by the museum staff, and
associated with each other. Whether they are viewed as being part of
the archive or a second type of store maintained by the museum isn't
being addressed here. They are all regarded as being artifacts in the archive.
There are two important aspects of Subversion, or other contemporary
SCM tools. One of these is a mechanism for storing and retrieving
versioned files. The second is a way of accessing files, setting up
working versions of the files, updating to a new version (where
permitted), and maintaining consistency among working copies of files.
A question raised during the discussion was how well the SCM system
would work for files that don't change. It works fine. Many files
within a software system remain static after they have been
established, and there is no penalty for having such files in the
system. At the same time, those files that change (such as
correcting a transcription error in an OCR-generated document) can
also be handled. With the versioning system, it is possible to
determine the contents of a directory as they were at some past point
in time. Finally, from the client's perspective, one system can be
used to access files in the museum archive that do not normally
change, and files from a project where changes often do take place.
Subversion is now a well-established system. There are 3 principal
developers, but about 70 developers have worked on it. There appear
to be several hundred people who have contributed money to the
project. So there is substantial effort put into seeing that the
system operates well, and into seeing that file storage of multiple
media types is efficient.
I couldn't locate the number of Subversion installations. But one of
the front-end tools that works with SVN -- TortoiseSVN -- announced
that they had 500,000 downloads during the past 6 weeks. This is a
large enough community that any future replacement of the SVN
technology will need to consider migration issues.
Subversion is also an interface for accessing files. The basic
interface is a command interface. Access may be made on a remote
computer, or there is a web interface for accessing a directory
structure from the web. I've only looked at the most basic access
control, but registered users can either be given read access or
read-write access. For the museum, I would expect that very few
people would have write access to the archive itself, but when
Subversion is used for projects then project members could have write access.
When Subversion is used on Windows through TortoiseSVN, files
accessed from archives have a special icon. A single operation can
update all of the files within a directory and its
subdirectories. These can be pulled from one archive or many, from
one location in the archive or from many, and from one or more
project directories containing the work of a team. Similarly, when a
project member has made modifications, the changes can be checked in
with a single operation, assuming that the person has write access.
I have concerns about a comment made during the meeting. The
suggestion was made that if you want to access a file in the archive,
you just copy it. In my experience, should a file change, having
multiple copies can lead to many questions. Is this a copy that
pre-dated the archive? Has this copy been updated to reflect an
unanticipated change to the archive? Is this copy intentionally
modified from the copy stored in the archive? These questions can
all be answered when using a system like Subversion.
Two other alternatives were discussed. I have not at this time had a
chance to research them. My comments are simply based on the
statements made in the meeting and my past experience.
One system was a simple file system approach. Changes in the archive
were anticipated, as there was a technique of writing a serial stream
of files, where a more recent file could be an update to a file
written earlier. A comment made during the session suggested that
this is a technique developed within the museum. My main concern is
that there would need to be documentation explaining this, that
different file types will need different ways of storing differences,
and that the implementation will not be well understood to preserve
the archive for hundreds of years. I see the subversion approach as
having a better basis for survivability, and that it will be easier
to find people who already understand its workings to come and assist
with the archiving effort. The linear nature of the file system
approach is modelled with modification dates and with database
entries in the case of Subversion.
I am also concerned about client software. If a client wants to read
a file from the archive, they'd need to have software for accessing
the archive files. If changes are made to the archive, new client
software will need to be distributed. This problem is already solved
for Subversion.
The other proposal is a more comprehensive proposal for higher level
software for managing a museum's virtual archive. (Sorry if I have
not characterized it properly.)
My biggest reservation with this system is that we would need to
decide all aspects of the museum system before starting to do
anything. It is like purchasing a complete home theater rather than
buying the components and putting them together. While you might
imagine doing that today, think about doing such a thing 20 years ago
when many of the technologies were still emerging. What I'm most
concerned about is delaying the establishment of an archive
technology while deciding whether the remaining technology is what we
really want.
One aspect of the technology that I like very much is that it is
layered. Layering allows us to select an implementation for one
layer independently of the implementation of other layers. A key to
layering is having interface standards between the layers; it was
commented that this standard is underway and nearing completion.
Comments at the meeting suggested that Subversion provides much of
the functionality needed at the innermost layer. I hope that
developers of this technology would therefore have at least one
implementation based on Subversion. Given the choice of one
implementation based on software being used by millions of people vs
waiting for someone to implement almost the same thing and then be a
user of this new software, I would select the former.
While it is good to know that this inner layer is based on a
standard, there are many cases where a standard is defined but
doesn't take, or is surpassed soon after its adoption by another
standard. I don't know what will happen here, but there is some
cause to be cautious about jumping on the standard too early, or at
least tying too much to that standard until it proves itself.
A greater concern comes with the selection of the implementation of
that standard. There was no mention of a storage standard that would
be used by all developers implementing this standard. So if we make
the wrong choice, we may find that the implementation we initially
choose does not survive, and we will be faced with choosing a new
implementation and converting the archive. There may be no
conversion software available, so we may need to create it
ourselves. By choosing a core technology like Subversion, migration
will be faced by enough people that products will be available to
help us. Also, the mechanism of Subversion lies below the file
system, so most programs can just access a local copy of the
versioned files. They don't need to be aware of the versioning
mechanism that lies beneath.
This brings up an interesting topic. How do we see the archive being
used? Suppose that there are 100 researchers working with the
archive, searching it and bringing some files into their projects of
whatever sort. Not all researchers are within the museum, but all
have been given access credentials by the museum. Here are two
approaches for supporting this.
The first approach has an archive that is accessible to all of the
researchers. They do searches using a search engine that has
analyzed the files. When they find an interesting artifact, they go
to the archive file to retrieve it. Some of these accesses will be
from within the museum, others will be made by researchers halfway
around the world. The archive is also being used by museum staff to
add new artifacts.
The second approach creates a copy of the archive that is used by the
researchers. The copy is periodically synchronized with the master
archive. The museum staff adds new artifacts to the master
archive. The search engine updates its database when the
synchronization operation takes place.
The museum staff should decide whether having an additional copy of
the archive provides extra security they view as being useful.
Bob
More information about the SCC_active
mailing list