[SCC_Active_Members] Request: a program for managing duplicates

H.M. Gladney hgladney at gmail.com
Thu Apr 26 10:09:51 PDT 2007


Does any recipient of this note know of a program that can detect duplicate
files in a collection and, for each replicated instance, replace all
occurrences but one with references to the one remaining?  If no such
program can be found, does some recipient have the skills and willingness to
create and share such a program?

In view of operating system difficulties with aberrant file names (described
in the attached e-mails), another helpful program would replace problematic
file names with similar names free of identified problems.  Why similar?
Should be obvious.

This request is stimulated by things I am learning while playing with the
September 2006 collection of Snobol-related files from U. Arizona and the
Greenstone Digital Library (GSDL) software.  Specifically, roughly 30% of
the 66,000 files in this collection are duplicates.  Loading files into a
Greenstone collection is a CPU-intensive process; this is partly because the
process includes constructing inverted indices for full-text search as well
as XML-packaging of each file to enable metadata labeling (a guess, because
I do not yet understand GSDL well enough to speak authoritatively about it.)
The magnitude of this is that, creating a GSDL image of the Snobol
collection takes approx. 50 hours on the Linux system that I'm using.  (It's
a relatively slow machine compared to machines that are inexpensive
today--based on a 1.8 Ghz Celeron CPU.)  The default GSDL collection
building behavior includes building a tree isomorphic to the directory/file
structure provided, with tree component labels that copy the names of the
input object structure.

Summing it up, I think it would be helpful to ensure that file names used
within a virtual museum collection do not create problems for any potential
museum visitor who extracts portions into a directory in his favorite
computing environment.  I further think it prudent to avoid storing
duplicates, but to retain/represent the connections that any file has at the
time a collection is delivered to the museum.

Cheerio, Henry

-----Original Message-----
From: Gregg Townsend [mailto:gmt at CS.Arizona.EDU] 
Sent: Thursday, April 26, 2007 8:57 AM
To: Paul McJones
Cc: chm-snobol at CS.Arizona.EDU
Subject: Re: New Snobol/Icon history DVD

> $ find . -type f -exec cat '{}' > /dev/null \; -print
> cat: ./CSCSNO/CON.SNO: No such file or directory
> cat: ./CSCSNO/PRN.SNO: Invalid request code
> cat: ./DRH1/CON.SNO: No such file or directory
> cat: ./DRHMSC/CON.SNO: No such file or directory

Here's a ZIP file of those four files after renaming.
They're not very interesting.

There are many files on the DVD that contain colons or end with periods;
those names are also problematic according to the web page you
referenced:
    http://msdn2.microsoft.com/en-us/library/aa365247.aspx

(There are also many directories and files with embedded spaces, which cause
problems with Unix shell scripts.)

Gregg

----------------------------------------------------------------------------
----
From: Paul McJones [mailto:paul at mcjones.org] 
Sent: Wednesday, April 25, 2007 6:20 PM
To: Gregg Townsend
Cc: chm-snobol at CS.Arizona.EDU
Subject: Re: New Snobol/Icon history DVD

I look forward to receiving the new DVD; I very greatful for the work that
you are doing. By the way, I found another class of filenames on the first
edition DVD that cause problems for Windows even though they are legitimate
on Unix/Linux (and probably Macintosh): there are files where the body of
the filename (ignoring the extension) is CON or PRN; these files can't be
opened. It turns out there is a kludge in Windows going back to MS-DOS where
the built-in device names are recognized regardless of the directory or
extension part of the filename! Here's a full list from MSDN
(http://msdn2.microsoft.com/en-us/library/aa365247.aspx):

Do not use the following reserved device names for the name of a file: CON,
PRN, AUX, NUL, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1,
LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9. Also avoid these names
followed by an extension, for example, NUL.tx7. 
Windows NT:  CLOCK$ is also a reserved device name.
Examples of files on the DVD that violate this are:

pmcjones at pmcjones-t60p /cygdrive/d/ARCHIVE_1996/SHP00
$ find . -type f -exec cat '{}' > /dev/null \; -print
cat: ./CSCSNO/CON.SNO: No such file or directory
cat: ./CSCSNO/PRN.SNO: Invalid request code
cat: ./DRH1/CON.SNO: No such file or directory
cat: ./DRHMSC/CON.SNO: No such file or directory

pmcjones at pmcjones-t60p /cygdrive/d/ARCHIVE_1996/SHP00
$

Paul

-----Original Message-----
From: Gregg Townsend [mailto:gmt at CS.Arizona.EDU] 
Sent: Tuesday, April 24, 2007 7:00 PM
To: chm-snobol at CS.Arizona.EDU
Subject: New Snobol/Icon history DVD

Greetings, all,

I have prepared a "second edition" DVD of historical Snobol and Icon files.
I'm sending copies to Henry, Mark, Paul, and Bob.  If anyone else on this
list would like a copy, please send me your postal address.

This disc differs from the first edition mainly in the archive-1996
directory.  I've renamed that to "s4-hist" after it became apparent that its
contents were collected as part of a local "Snobol History Project".

I found several copies of that SHP archive in different forms and merged
them together.  Some of these supplied new directories to join the existing
ones.  It became clear that the earlier division into SHP00 etc. was just an
artifact of packaging, so I have flattened the directory structure and
removed that level.

Some miscellaneous problems were fixed:
-- Some Unix directories originally had files that differed only by
lettercase, such as foo and Foo, and one of these was typically lost from
the DVD.  Such cases are now handled by storing two files named foo and
+Foo.
-- I managed to repair two tar files that wouldn't unpack earlier.
-- Some files that were incorrectly combined have now been properly split.
-- I removed some irrelevant files (very few -- under a dozen).

What's new on this disc?  I've realized in writing this message that I don't
know the answer.  While I've spent a lot of time carefully combining and
checking archives, I'm still not very familiar with the content and I can't
really answer that question.  I'm concentrating on the collection role,
which I must fill, and I'm leaving the curatorial activities to others.

I had originally intended that this would be the "final" collection, but I
still have more archives to check.  In the last remaining set of files I've
already spotted a copy of Version 2 of Icon, which is not on this new DVD,
and I'm hoping to find a copy of Version 1, which I now have only in
hardcopy printout.

I'd suggest treating this new DVD as a replacement for the first edition;
that's the intent, and it does correct several problems.  I'm hoping that
anything more to come will be more supplemental in nature.

Gregg

-------------- next part --------------
An HTML attachment was scrubbed...
URL: ../attachments/20070426/d0cf6d66/attachment-0002.html


More information about the SCC_active mailing list