FW: [SCC_Active_Members] Capturing information from the WWW
Bernard L. Peuto
blpeuto at peuto.com
Fri Jan 26 21:28:51 PST 2007
> -----Original Message-----
> From: Jim Manley [mailto:jim_manley at hotmail.com]
> Sent: Tuesday, January 23, 2007 12:25 AM
> To: scc_active-bounces at computerhistory.org
> Subject: RE: [SCC_Active_Members] Capturing information from the WWW
>
>
> Hi everyone,
>
> I'm not sure whether it's because of changes in robots.txt
> files, some other change in site/server configuration, or
> funding limitations combined with continued exponential
> growth of the WWW, but I've noticed that archive.org doesn't
> have current archives on some sites in which I have an
> interest (sometimes beyond our purposes here), with gaps
> going back years, in some cases.
>
> Also, Google has very publicly stated their interest in
> archiving, as well as indexing, all forms of information, and
> since they're also a very publicly-traded, for-profit
> company, it is possible that obtaining archives from them may
> incur financial charges at some point in the future. There
> are a number of other for-profit and private on-line archives
> (their existence is public, but access is restricted),
> particularly for rare documents, or those requiring special
> efforts to capture (e.g., highly-detailed historic maps,
> charts, engineering drawings, etc.) that contain extremely
> large amounts of content that the CHM may never have
> sufficient storage to acquire, beyond the intellectual
> property issues. Another repository for significant on-line
> and off-line material of potentially overwhelming detail and
> size is the library being contemplated at
> http://www.longnow.org , including the Longserver project at
> http://www.longserver.org , the FormatExchange project at http://www!
> .formatexchange.org , and the historical and
> future-guesstimating Longview project at
> http://www.longnow.org/about/longview.php (the Longview tool
> is downloadable, BTW, and may be of use to the CHM, if it
> isn't already being used). Does anyone know if CHM has any
> regular interaction with the LongNow Foundation folks, BTW?
> I'll volunteer, if not, as I have some connections to folks
> there from my DARPA project days.
>
> Then, there are the ever-increasing number of sites that
> don't lend themselves to link-following archiving techniques,
> because their content is stored in internal database-oriented
> web servers that rely on specific queries (think about how
> you might try to archive Google itself - not that you would
> for that particular site, but think about smaller
> content-based sites that have a similar kind of public
> interface). There are also the classes of web-based
> application servers which require user interaction and that
> will likely never be able to be archived without direct
> cooperation from the organizations that host them (e.g.,
> providing us a copy of their server-side software, which is
> highly unlikely, even years after it's no longer publicly
> accessible - perhaps the children/grandchildren of the
> younger members here will be around if/when these ever become
> available to us, assuming anyone can find the backups and
> there are still systems capable of reading them).
>
> There are more problems in Heaven and Earth than are dreamt
> of in our philosophy, Horatio :)
>
> All the Best,
> Jim
>
>
> ----------------------------------------
> > From: LMM at acm.org
> > To: aek at bitsavers.org
> > Subject: RE: [SCC_Active_Members] Capturing information from the WWW
> > Date: Mon, 22 Jan 2007 21:26:55 -0800
> > CC: scc_active at computerhistory.org
> >
> > This is really an area where there is already an organization that
> > archives public web sites and makes the archives available,
> >
> > http://www.archive.org/web/web.php
> >
> > Why should CHM do its own spidering? What am I missing?
> >
> >
> > Larry
>
>
> _________________________________________________________________
> Get the Live.com Holiday Page for recipes, gift-giving ideas,
> and more.
> www.live.com/?addtemplate=holiday
>
More information about the SCC_active
mailing list