FW: [SCC_Active_Members] Request: a program for managing duplicates

H.M. Gladney hgladney at gmail.com
Thu Apr 26 11:16:50 PDT 2007


FYI, Hans Pufal (in Grenoble, France) and I just discussed by telephone my
request and his proposed program.  As I understand it, what's wanted can be
accomplished in a fairly brief AWK script.  Furthermore, he's been working
with Al Kossow for something similar for collections the CHM holds.

Specifically, we discussed some technical aspects that would make the
program useful for such similar situations as we can today anticipate, with
a view both to one-time collection clean-up and also to eventual virtual
museum visitors' convenience (e.g., in extracting parts of a collection into
directories/files in their own OS environement.)  

Hans volunteered to upgrade a similar program that he has in hand, both for
the requirements we discussed and also with a view to potential third party
tailoring (i.e., help with internal documentation).  He plans to write this
program over the weekend.  Then I will test it and he and I will discuss
what we have with a view to refinements.  At least the first version of the
program we discussed might leave some anticipated requirements that are not
short term needs unsatisfied, but identified in the program's documentation.
For instance, handling of references has some subtleties that seem
unimportant now, but not necessarily unimportant in some distant future.
Hans did identify one addition to my specification--providing a log of all
changes made so that an eventual user bothered by a change could reverse it.


Eventually the program being discussed might be placed in a museum
collection of tools useful to museums.

We would welcome further suggestions.

Cheerio, Henry

-----Original Message-----
From: hans at pufal.net [mailto:hans at pufal.net] 
Sent: Thursday, April 26, 2007 10:26 AM
To: chm-snobol at CS.Arizona.EDU; 'H . M . Gladney'
Subject: Re: [SCC_Active_Members] Request: a program for managing duplicates

On Thu Apr 26 10:09 , "H.M. Gladney"  sent:  >Request: a program for
managing duplicates  >Does any recipient of this note know of a program that
can detect duplicate files in a collection and, for each replicated
instance, replace all occurrences but one with references to the one
remaining?  If no such program can be found, does some recipient have the
skills and willingness to create and share such a program?  

I can provide some input on this having done something similar recently.  I
tackled the duplicate file problem by generating alistof all the files, then
using a simple AWK script generated the MD5 checksums of each. Sorting on
the checksums and running unique shows up all duplicate file.  If this is
too cryptic and you have AWK available I can build you an AWK script which
does all this in one swoop.  Another simple AWK script can read the file
list and rename "problematic"file names,specifically replacing space with
'_'. Again I can provide a script if necessary.  Alternatively, if you can
alter the scripts you are using, enclose all filenames in double quotes,
this gets rid of most (but not all) problems.    

-- Hans  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: ../attachments/20070426/66ccd541/attachment-0002.html


More information about the SCC_active mailing list