[SCC_Active_Members] Capturing information from the WWW
Gordon Bell
gbell at microsoft.com
Tue Jan 23 13:22:43 PST 2007
I was looking at this to solve my own problem.
My own Microsoft site is at Archive.org.
Was really being lazy and wanting the museum to offer a service to host dusty old and deceased artifacts and people...like museums do.
Was just trying to solve my own problem --- so will get myself a site to maintain or not bother with it on leaving Microsoft.
I have several domains including MyLifeBits.com that I can use.
g
-----Original Message-----
From: scc_active-bounces at computerhistory.org [mailto:scc_active-bounces at computerhistory.org] On Behalf Of Len Shustek
Sent: Tuesday, January 23, 2007 8:19 AM
To: Paul McJones; Larry Masinter
Cc: 'SCC at CHM'
Subject: Re: [SCC_Active_Members] Capturing information from the WWW
Paul's points are excellent. Archive.org is doing a great service,
and they've gotten considerably better at making acceptable copies of
some pretty fancy websites. But it breaks on some sophisticated
scripts (particularly Java-based) and most dynamically-generated
pages. We have most of our collections catalog online, for example,
and archive.org captures none of it.
-- Len
At 07:21 AM 1/23/2007, Paul McJones wrote:
>Here's a "random example": look for www.bobbemer.com at
>www.archive.org. It says the site's robots.txt blocked their spider.
>Bob Bemer had a very interesting web site that went offline a few
>months after he died, in 2004. Perhaps the Internet Archive actually
>has a copy, but who's to know; Al and I and various other people
>made copies. The reason the Museum can and should maintain archives
>of web sites include:
>
> Having specific goals of what's worth keeping (curation);
> Having a commitment for long-term preservation;
> Having a commitment to work with authors of
> historically-relevant server-based web sites to mirror their
> underlying databases, not just to crawl their content;
> Etc.
>
>
>Paul
>
>Larry Masinter wrote:
>>This is really an area where there is already an organization
>>that archives public web sites and makes the archives available,
>>
>>http://www.archive.org/web/web.php
>>
>>Why should CHM do its own spidering? What am I missing?
>>
>>
>>Larry
>>
>>
>>_______________________________________________
>>SCC_active mailing list
>>SCC_active at computerhistory.org
>>http://mail.computerhistory.org/mailman/listinfo/scc_active
>>
>
>_______________________________________________
>SCC_active mailing list
>SCC_active at computerhistory.org
>http://mail.computerhistory.org/mailman/listinfo/scc_active
_______________________________________________
SCC_active mailing list
SCC_active at computerhistory.org
http://mail.computerhistory.org/mailman/listinfo/scc_active
More information about the SCC_active
mailing list