<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii">
<META NAME="Generator" CONTENT="MS Exchange Server version 6.5.7036.0">
<TITLE>RE: [SCC_Active_Members] Subversion as a basis for software archive</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/rtf format -->
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">As other notes have suggested, the e-mail conversations within which the attachment occurs hide assumptions</FONT><FONT SIZE=2 FACE="Arial"></FONT> <FONT SIZE=2 FACE="Arial">that might differ among authors.</FONT></SPAN></P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">I would like to draw the participants' attention to one such apparent unspoken assumption: that the pace of collecting "digital stuff" should be coupled to the pace of accession. Particularly if one has formal accession into a CHM collection in mind, close coupling of these paces seems to me impractical, because formal accession might need to be a painstaking resource-intensive process (which might be part of why Al is thinking about requirements right now), whereas "getting our hands on digital stuff" includes an objective of acquiring such stuff before some of it disappears forever. (Since we cannot predict what might still exist today, but be lost in the next few years, a reasonable tactic is to collect without excessive concern for quality, sorting out the best stuff years from now.) (**)</FONT></SPAN></P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">A crude quantification estimate might help to illustrate the point, using the Snobol stuff that we currently hold as an example. This contains roughly 30,000 unique files (in the total 65,000 files we received from Arizona). We already know that this collection is an incomplete representation of the program family closely related to Snobol. One reason for poking into this collection (as Bob Goldberg and Paul McJones have started to do) is to identify major omissions which we might then seek. Guessing that we have roughly 30% of what a "complete" collection might hold, the eventual Snobol-related collection would be represented by about 100,000 files.</FONT></SPAN></P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">A very crude estimate of the number of collection topics that are of similar scale to Snobol-family is that there might be between 100 and 500 such topics of interest for a virtual software museum, with a total file count of something between 10,000,000 and 50,000,000. So far, SPG has a start on about a half dozen such.collections. Is it reasonable to plan that it will take SPG between 20 and 100 years to acquire the rest? (#)</FONT></SPAN></P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">The eventual Snobol collection might be describable by between 10 and 50 subtopics. For formal accession, each subtopic might require the time and effort of a curator for between a week and two months. (Creating metadata, investigating provenance, selecting files for which the provenance is relatively well known or discoverable, providing access paths for historical scholars, etc.) If this is correct, accessioning the Snobol collection might cost between 10 man-weeks and 8 man-years.(#) Accessioning the whole potential body would take between 20 man-years and 4000 man-years. (##)</FONT></SPAN></P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">To me, this line of reasoning suggests that we should avoid close coupling between the pace of collection activities and that of accessioning work. The readers of this note might want to infer additional or different tactical implications.</FONT></SPAN></P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">Cheerio, Henry</FONT></SPAN>
</P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">(**) I do not intend to imply undiscriminating collection. Furthermore, I believe that some affordable level of provenance annotation should be part of current "getting our hands on digital stuff" activities.</FONT></SPAN></P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">(#) The qualitative points that emerge are unchanged even if I am over-estimating by an order of magnitude!</FONT></SPAN>
</P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">(##) Obviously, this will not be accomplished in the foreseeable future. Instead, people will pick their favorite topics for resource expenditures.</FONT></SPAN></P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">-----Original Message-----</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">From: scc_active-bounces@computerhistory.org [</FONT></SPAN><A HREF="mailto:scc_active-bounces@computerhistory.org"><SPAN LANG="en-us"><U><FONT COLOR="#0000FF" SIZE=2 FACE="Arial">mailto:scc_active-bounces@computerhistory.org</FONT></U></SPAN></A><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">] On Behalf Of Al Kossow</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">Sent: Friday, April 06, 2007 7:38 PM</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">Cc: scc_active@computerhistory.org</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">Subject: Re: [SCC_Active_Members] Subversion as a basis for software archive</FONT></SPAN>
</P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">Van Snyder wrote:</FONT></SPAN>
</P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial"> > How about CVS or SCCS? I think these are based on plain files.</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial"> ></FONT></SPAN>
</P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">Would the collective 'we' on this list please refrain from suggesting solutions to this 'problem' until the museum staff has time to actually generate a REQUIREMENTS document?</FONT></SPAN></P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">I'm sorry about being so blunt, but as an engineer I cannot understand how any sort of rational discussion on this subject can be made until the actual problem to solve is presented.</FONT></SPAN></P>
<BR>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">_______________________________________________</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">SCC_active mailing list</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">SCC_active@computerhistory.org</FONT></SPAN>
<BR><SPAN LANG="en-us"></SPAN><A HREF="http://mail.computerhistory.org/mailman/listinfo/scc_active"><SPAN LANG="en-us"><U><FONT COLOR="#0000FF" SIZE=2 FACE="Arial">http://mail.computerhistory.org/mailman/listinfo/scc_active</FONT></U></SPAN></A><SPAN LANG="en-us"></SPAN>
</P>
</BODY>
</HTML>