FW: [SCC_Active_Members] Capturing information from the WWW

Fri Jan 26 21:28:51 PST 2007

 

> -----Original Message-----
> From: Jim Manley [mailto:jim_manley at hotmail.com] 
> Sent: Tuesday, January 23, 2007 12:25 AM
> To: scc_active-bounces at computerhistory.org
> Subject: RE: [SCC_Active_Members] Capturing information from the WWW
> 
> 
> Hi everyone,
> 
> I'm not sure whether it's because of changes in robots.txt 
> files, some other change in site/server configuration, or 
> funding limitations combined with continued exponential 
> growth of the WWW, but I've noticed that archive.org doesn't 
> have current archives on some sites in which I have an 
> interest (sometimes beyond our purposes here), with gaps 
> going back years, in some cases.
> 
> Also, Google has very publicly stated their interest in 
> archiving, as well as indexing, all forms of information, and 
> since they're also a very publicly-traded, for-profit 
> company, it is possible that obtaining archives from them may 
> incur financial charges at some point in the future.  There 
> are a number of other for-profit and private on-line archives 
> (their existence is public, but access is restricted), 
> particularly for rare documents, or those requiring special 
> efforts to capture (e.g., highly-detailed historic maps, 
> charts, engineering drawings, etc.) that contain extremely 
> large amounts of content that the CHM may never have 
> sufficient storage to acquire, beyond the intellectual 
> property issues.  Another repository for significant on-line 
> and off-line material of potentially overwhelming detail and 
> size is the library being contemplated at 
> http://www.longnow.org , including the Longserver project at 
> http://www.longserver.org ,  the FormatExchange project at http://www!
>  .formatexchange.org , and the historical and 
> future-guesstimating Longview project at 
> http://www.longnow.org/about/longview.php (the Longview tool 
> is downloadable, BTW, and may be of use to the CHM, if it 
> isn't already being used).  Does anyone know if CHM has any 
> regular interaction with the LongNow Foundation folks, BTW?  
> I'll volunteer, if not, as I have some connections to folks 
> there from my DARPA project days.
> 
> Then, there are the ever-increasing number of sites that 
> don't lend themselves to link-following archiving techniques, 
> because their content is stored in internal database-oriented 
> web servers that rely on specific queries (think about how 
> you might try to archive Google itself - not that you would 
> for that particular site, but think about smaller 
> content-based sites that have a similar kind of public 
> interface).  There are also the classes of web-based 
> application servers which require user interaction and that 
> will likely never be able to be archived without direct 
> cooperation from the organizations that host them (e.g., 
> providing us a copy of their server-side software, which is 
> highly unlikely, even years after it's no longer publicly 
> accessible - perhaps the children/grandchildren of the 
> younger members here will be around if/when these ever become 
> available to us, assuming anyone can find the backups and 
> there are still systems capable of reading them).
> 
> There are more problems in Heaven and Earth than are dreamt 
> of in our philosophy, Horatio :)
> 
> All the Best,
> Jim
> 
> 
> ----------------------------------------
> > From: LMM at acm.org
> > To: aek at bitsavers.org
> > Subject: RE: [SCC_Active_Members] Capturing information from the WWW
> > Date: Mon, 22 Jan 2007 21:26:55 -0800
> > CC: scc_active at computerhistory.org
> > 
> > This is really an area where there is already an organization that 
> > archives public web sites and makes the archives available,
> > 
> > http://www.archive.org/web/web.php
> > 
> > Why should CHM do its own spidering? What am I missing?
> > 
> > 
> > Larry
> 
> 
> _________________________________________________________________
> Get the Live.com Holiday Page for recipes, gift-giving ideas, 
> and more.
> www.live.com/?addtemplate=holiday
>