<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> <HTML> <HEAD> <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii"> <META NAME="Generator" CONTENT="MS Exchange Server version 6.5.7036.0"> <TITLE>RE: [SCC_Active_Members] Subversion as a basis for software archive</TITLE> </HEAD> <BODY>  As other notes have suggested, the e-mail conversations within which the attachment occurs hide assumptions that might differ among authors. I would like to draw the participants' attention to one such apparent unspoken assumption: that the pace of collecting "digital stuff" should be coupled to the pace of accession.  Particularly if one has formal accession into a CHM collection in mind, close coupling of these paces seems to me impractical, because formal accession might need to be a painstaking resource-intensive process (which might be part of why Al is thinking about requirements right now), whereas "getting our hands on digital stuff" includes an objective of acquiring such stuff before some of it disappears forever.  (Since we cannot predict what might still exist today, but be lost in the next few years, a reasonable tactic is to collect without excessive concern for quality, sorting out the best stuff years from now.) (**) A crude quantification estimate might help to illustrate the point, using the Snobol stuff that we currently hold as an example.  This contains roughly 30,000 unique files (in the total 65,000 files we received from Arizona).  We already know that this collection is an incomplete representation of the program family closely related to Snobol.  One reason for poking into this collection (as Bob Goldberg and Paul McJones have started to do) is to identify major omissions which we might then seek.  Guessing that we have roughly 30% of what a "complete" collection might hold, the eventual Snobol-related collection would be represented by about 100,000 files. A very crude estimate of the number of collection topics that are of similar scale to Snobol-family is that there might be between 100 and 500 such topics of interest for a virtual software museum, with a total file count of something between 10,000,000 and 50,000,000.  So far, SPG has a start on about a half dozen such.collections.  Is it reasonable to plan that it will take SPG between 20 and 100 years to acquire the rest? (#) The eventual Snobol collection might be describable by between 10 and 50 subtopics.  For formal accession, each subtopic might require the time and effort of a curator for between a week and two months.  (Creating metadata, investigating provenance, selecting files for which the provenance is relatively well known or discoverable, providing access paths for historical scholars, etc.)  If this is correct, accessioning the Snobol collection might cost between 10 man-weeks and 8 man-years.(#)  Accessioning the whole potential body would take between 20 man-years and 4000 man-years. (##) To me, this line of reasoning suggests that we should avoid close coupling between the pace of collection activities and that of accessioning work.  The readers of this note might want to infer additional or different tactical implications. Cheerio, Henry (**) I do not intend to imply undiscriminating collection.  Furthermore, I believe that some affordable level of provenance annotation should be part of current "getting our hands on digital stuff" activities. (#) The qualitative points that emerge are unchanged even if I am over-estimating by an order of magnitude! (##) Obviously, this will not be accomplished in the foreseeable future.  Instead, people will pick their favorite topics for resource expenditures. -----Original Message----- From: scc_active-bounces@computerhistory.org [<A HREF="mailto:scc_active-bounces@computerhistory.org">mailto:scc_active-bounces@computerhistory.org</A>] On Behalf Of Al Kossow Sent: Friday, April 06, 2007 7:38 PM Cc: scc_active@computerhistory.org Subject: Re: [SCC_Active_Members] Subversion as a basis for software archive Van Snyder wrote:  > How about CVS or SCCS?  I think these are based on plain files.  > Would the collective 'we' on this list please refrain from suggesting solutions to this 'problem' until the museum staff has time to actually generate a REQUIREMENTS document? I'm sorry about being so blunt, but as an engineer I cannot understand how any sort of rational discussion on this subject can be made until the actual problem to solve is presented. _______________________________________________ SCC_active mailing list SCC_active@computerhistory.org <A HREF="http://mail.computerhistory.org/mailman/listinfo/scc_active">http://mail.computerhistory.org/mailman/listinfo/scc_active</A> </BODY> </HTML>