[SCC_Active_Members] Capturing information from the WWW
Len Shustek
len at shustek.com
Tue Jan 23 08:18:46 PST 2007
Paul's points are excellent. Archive.org is doing a great service,
and they've gotten considerably better at making acceptable copies of
some pretty fancy websites. But it breaks on some sophisticated
scripts (particularly Java-based) and most dynamically-generated
pages. We have most of our collections catalog online, for example,
and archive.org captures none of it.
-- Len
At 07:21 AM 1/23/2007, Paul McJones wrote:
>Here's a "random example": look for www.bobbemer.com at
>www.archive.org. It says the site's robots.txt blocked their spider.
>Bob Bemer had a very interesting web site that went offline a few
>months after he died, in 2004. Perhaps the Internet Archive actually
>has a copy, but who's to know; Al and I and various other people
>made copies. The reason the Museum can and should maintain archives
>of web sites include:
>
> Having specific goals of what's worth keeping (curation);
> Having a commitment for long-term preservation;
> Having a commitment to work with authors of
> historically-relevant server-based web sites to mirror their
> underlying databases, not just to crawl their content;
> Etc.
>
>
>Paul
>
>Larry Masinter wrote:
>>This is really an area where there is already an organization
>>that archives public web sites and makes the archives available,
>>
>>http://www.archive.org/web/web.php
>>
>>Why should CHM do its own spidering? What am I missing?
>>
>>
>>Larry
>>
>>
>>_______________________________________________
>>SCC_active mailing list
>>SCC_active at computerhistory.org
>>http://mail.computerhistory.org/mailman/listinfo/scc_active
>>
>
>_______________________________________________
>SCC_active mailing list
>SCC_active at computerhistory.org
>http://mail.computerhistory.org/mailman/listinfo/scc_active
More information about the SCC_active
mailing list