[SCC_Active_Members] Capturing information from the WWW

Len Shustek len at shustek.com
Tue Jan 23 08:18:46 PST 2007


Paul's points are excellent.  Archive.org is doing a great service, 
and they've gotten considerably better at making acceptable copies of 
some pretty fancy websites.  But it breaks on some sophisticated 
scripts (particularly Java-based) and most dynamically-generated 
pages.  We have most of our collections catalog online, for example, 
and archive.org captures none of it.
-- Len


At 07:21 AM 1/23/2007, Paul McJones wrote:
>Here's a "random example": look for www.bobbemer.com at 
>www.archive.org. It says the site's robots.txt blocked their spider. 
>Bob Bemer had a very interesting web site that went offline a few 
>months after he died, in 2004. Perhaps the Internet Archive actually 
>has a copy, but who's to know; Al and I and various other people 
>made copies. The reason the Museum can and should maintain archives 
>of web sites include:
>
>    Having specific goals of what's worth keeping (curation);
>    Having a commitment for long-term preservation;
>    Having a commitment to work with authors of 
> historically-relevant server-based web sites to mirror their 
> underlying databases, not just to crawl their content;
>    Etc.
>
>
>Paul
>
>Larry Masinter wrote:
>>This is really an area where there is already an organization
>>that archives public web sites and makes the archives available,
>>
>>http://www.archive.org/web/web.php
>>
>>Why should CHM do its own spidering? What am I missing?
>>
>>
>>Larry
>>
>>
>>_______________________________________________
>>SCC_active mailing list
>>SCC_active at computerhistory.org
>>http://mail.computerhistory.org/mailman/listinfo/scc_active
>>
>
>_______________________________________________
>SCC_active mailing list
>SCC_active at computerhistory.org
>http://mail.computerhistory.org/mailman/listinfo/scc_active




More information about the SCC_active mailing list