<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii">
<META NAME="Generator" CONTENT="MS Exchange Server version 6.5.7036.0">
<TITLE>Request: a program for managing duplicates</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/rtf format -->
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">Does any recipient of this note know of a program that can detect duplicate files in a collection and, for each replicated instance, replace all occurrences but one with references to the one remaining? If no such program can be found, does some recipient have the skills and willingness to create and share such a program?</FONT></SPAN></P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">In view of operating system difficulties with aberrant file names (described in the attached</FONT><FONT SIZE=2 FACE="Arial"></FONT> <FONT SIZE=2 FACE="Arial">e-mails), another helpful program would replace problematic file names with similar names free of identified problems. Why similar? Should be obvious.</FONT></SPAN></P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">This request is stimulated by things I am learning while playing with the September 2006 collection of Snobol-related files from U. Arizona and the Greenstone Digital Library (GSDL) software. Specifically, roughly 30% of the 66,000 files in this collection are duplicates. Loading files into a Greenstone collection is a CPU-intensive process; this is partly because the process includes constructing inverted indices for full-text search as well as XML-packaging of each file to enable metadata labeling (a guess, because I do not yet understand GSDL well enough to speak authoritatively about it.) The magnitude of this is that, creating a GSDL image of the Snobol collection takes approx. 50 hours on the Linux system that I'm using. (It's a relatively slow machine compared to machines that are inexpensive today--based on a 1.8 Ghz Celeron CPU.) The default GSDL collection building behavior includes building a tree isomorphic to the directory/file structure provided, with tree component labels that copy the names of the input object structure.</FONT></SPAN></P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">Summing it up, I think it would be helpful to ensure that file names used within a virtual museum collection do not create problems for any potential museum visitor who extracts portions into a directory in his favorite computing environment. I further think it prudent to avoid storing duplicates, but to retain/represent the connections that any file has at the time a collection is delivered to the museum.</FONT></SPAN></P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">Cheerio, Henry</FONT></SPAN>
</P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">-----Original Message-----</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">From: Gregg Townsend [</FONT></SPAN><A HREF="mailto:gmt@CS.Arizona.EDU"><SPAN LANG="en-us"><U><FONT COLOR="#0000FF" SIZE=2 FACE="Arial">mailto:gmt@CS.Arizona.EDU</FONT></U></SPAN></A><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">] </FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">Sent: Thursday, April 26, 2007 8:57 AM</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">To: Paul McJones</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">Cc: chm-snobol@CS.Arizona.EDU</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">Subject: Re: New Snobol/Icon history DVD</FONT></SPAN>
</P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">> $ find . -type f -exec cat '{}' > /dev/null \; -print</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">> cat: ./CSCSNO/CON.SNO: No such file or directory</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">> cat: ./CSCSNO/PRN.SNO: Invalid request code</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">> cat: ./DRH1/CON.SNO: No such file or directory</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">> cat: ./DRHMSC/CON.SNO: No such file or directory</FONT></SPAN>
</P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">Here's a ZIP file of those four files after renaming.</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">They're not very interesting.</FONT></SPAN>
</P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">There are many files on the DVD that contain colons or end with periods; those names are also problematic according to the web page you</FONT></SPAN></P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">referenced:</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial"> </FONT></SPAN><A HREF="http://msdn2.microsoft.com/en-us/library/aa365247.aspx"><SPAN LANG="en-us"><U><FONT COLOR="#0000FF" SIZE=2 FACE="Arial">http://msdn2.microsoft.com/en-us/library/aa365247.aspx</FONT></U></SPAN></A><SPAN LANG="en-us"></SPAN>
</P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">(There are also many directories and files with embedded spaces, which</FONT><FONT SIZE=2 FACE="Arial"></FONT> <FONT SIZE=2 FACE="Arial">cause problems with Unix shell scripts.)</FONT></SPAN>
</P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">Gregg</FONT></SPAN>
</P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">--------------------------------------------------------------------------------</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">From: Paul McJones [</FONT></SPAN><A HREF="mailto:paul@mcjones.org"><SPAN LANG="en-us"><U><FONT COLOR="#0000FF" SIZE=2 FACE="Arial">mailto:paul@mcjones.org</FONT></U></SPAN></A><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">] </FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">Sent: Wednesday, April 25, 2007 6:20 PM</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">To: Gregg Townsend</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">Cc: chm-snobol@CS.Arizona.EDU</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">Subject: Re: New Snobol/Icon history DVD</FONT></SPAN>
</P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">I look forward to receiving the new DVD; I very greatful for the work that you are doing. By the way, I found another class of filenames on the first edition DVD that cause problems for Windows even though they are legitimate on Unix/Linux (and probably Macintosh): there are files where the body of the filename (ignoring the extension) is CON or PRN; these files can't be opened. It turns out there is a kludge in Windows going back to MS-DOS where the built-in device names are recognized regardless of the directory or extension part of the filename! Here's a full list from MSDN (</FONT></SPAN><A HREF="http://msdn2.microsoft.com/en-us/library/aa365247.aspx"><SPAN LANG="en-us"><U><FONT COLOR="#0000FF" SIZE=2 FACE="Arial">http://msdn2.microsoft.com/en-us/library/aa365247.aspx</FONT></U></SPAN></A><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">):</FONT></SPAN></P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">Do not use the following reserved device names for the name of a file: CON, PRN, AUX, NUL, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9. Also avoid these names followed by an extension, for example, NUL.tx7. </FONT></SPAN></P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">Windows NT: CLOCK$ is also a reserved device name.</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">Examples of files on the DVD that violate this are:</FONT></SPAN>
</P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">pmcjones@pmcjones-t60p /cygdrive/d/ARCHIVE_1996/SHP00</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">$ find . -type f -exec cat '{}' > /dev/null \; -print</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">cat: ./CSCSNO/CON.SNO: No such file or directory</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">cat: ./CSCSNO/PRN.SNO: Invalid request code</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">cat: ./DRH1/CON.SNO: No such file or directory</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">cat: ./DRHMSC/CON.SNO: No such file or directory</FONT></SPAN>
</P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">pmcjones@pmcjones-t60p /cygdrive/d/ARCHIVE_1996/SHP00</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">$</FONT></SPAN>
</P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">Paul</FONT></SPAN>
</P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">-----Original Message-----</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">From: Gregg Townsend [</FONT></SPAN><A HREF="mailto:gmt@CS.Arizona.EDU"><SPAN LANG="en-us"><U><FONT COLOR="#0000FF" SIZE=2 FACE="Arial">mailto:gmt@CS.Arizona.EDU</FONT></U></SPAN></A><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">] </FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">Sent: Tuesday, April 24, 2007 7:00 PM</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">To: chm-snobol@CS.Arizona.EDU</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">Subject: New Snobol/Icon history DVD</FONT></SPAN>
</P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">Greetings, all,</FONT></SPAN>
</P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">I have prepared a "second edition" DVD of historical Snobol and Icon files. I'm sending copies to Henry, Mark, Paul, and Bob. If anyone else on this list would like a copy, please send me your postal address.</FONT></SPAN></P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">This disc differs from the first edition mainly in the archive-1996 directory. I've renamed that to "s4-hist" after it became apparent that its contents were collected as part of a local "Snobol History Project".</FONT></SPAN></P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">I found several copies of that SHP archive in different forms and merged them together. Some of these supplied new directories to join the existing ones. It became clear that the earlier division into SHP00 etc. was just an artifact of packaging, so I have flattened the directory structure and removed that level.</FONT></SPAN></P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">Some miscellaneous problems were fixed:</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">-- Some Unix directories originally had files that differed only by lettercase, such as foo and Foo, and one of these was typically lost from the DVD. Such cases are now handled by storing two files named foo and +Foo.</FONT></SPAN></P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">-- I managed to repair two tar files that wouldn't unpack earlier.</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">-- Some files that were incorrectly combined have now been properly split.</FONT></SPAN>
<BR><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">-- I removed some irrelevant files (very few -- under a dozen).</FONT></SPAN>
</P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">What's new on this disc? I've realized in writing this message that I don't know the answer. While I've spent a lot of time carefully combining and checking archives, I'm still not very familiar with the content and I can't really answer that question. I'm concentrating on the collection role, which I must fill, and I'm leaving the curatorial activities to others.</FONT></SPAN></P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">I had originally intended that this would be the "final" collection, but I still have more archives to check. In the last remaining set of files I've already spotted a copy of Version 2 of Icon, which is not on this new DVD, and I'm hoping to find a copy of Version 1, which I now have only in hardcopy printout.</FONT></SPAN></P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">I'd suggest treating this new DVD as a replacement for the first edition; that's the intent, and it does correct several problems. I'm hoping that anything more to come will be more supplemental in nature.</FONT></SPAN></P>
<P><SPAN LANG="en-us"><FONT SIZE=2 FACE="Arial">Gregg</FONT></SPAN>
</P>
</BODY>
</HTML>