[SCC_Active_Members] Capturing information from the WWW

Tue Jan 23 08:17:01 PST 2007

An addition to these reasons is a simple mechanical one: for someone to
select and organize material is made much easier by having a wealth of
candidate material on a local machine than elsewhere.  For instance, this
enables desktop search that is orders of magnitude faster than Web search.
(The expensive resource is the time and energy of interested experts--not
any machine resource.)

Collecting a lot of stuff onto a local machine does not imply intention to
keep most of that stuff for a long time in any local environment, which is a
major purpose of the Internet Archive (IA).  We might from time to time call
on IA resources when there is no more convenient source for specific
desirable content.

By copy of this e-mail, I am asking Bruce Baumgart whether I have overlooked
pertinent IA ease-of-use features that did not exist 3 years ago.

Cheerio, Henry

-----Original Message-----
From: scc_active-bounces at computerhistory.org
[mailto:scc_active-bounces at computerhistory.org] On Behalf Of Paul McJones
Sent: Tuesday, January 23, 2007 7:22 AM
To: Larry Masinter
Cc: 'SCC at CHM'
Subject: Re: [SCC_Active_Members] Capturing information from the WWW

Here's a "random example": look for www.bobbemer.com at www.archive.org. 
It says the site's robots.txt blocked their spider. Bob Bemer had a very
interesting web site that went offline a few months after he died, in 2004.
Perhaps the Internet Archive actually has a copy, but who's to know; Al and
I and various other people made copies. The reason the Museum can and should
maintain archives of web sites include:

    Having specific goals of what's worth keeping (curation);
    Having a commitment for long-term preservation;
    Having a commitment to work with authors of historically-relevant
server-based web sites to mirror their underlying databases, not just to
crawl their content;
    Etc.

Paul

Larry Masinter wrote:
> This is really an area where there is already an organization that 
> archives public web sites and makes the archives available,
>
> http://www.archive.org/web/web.php
>
> Why should CHM do its own spidering? What am I missing?
>
>
> Larry
>
_______________________________________________
SCC_active mailing list
SCC_active at computerhistory.org
http://mail.computerhistory.org/mailman/listinfo/scc_active