[SCC_Active_Members] Capturing information from theWWW[D[D[D[D[D[D[D[D[D[D[D[D[D[D[D[D[D[D[D[D[D[D[DAtticand Parlor recap([C[C[C[C[C[C[C[C[C[C[C[C[C[C[C[C[C[C[C[C[C[C[C[C[C[C[C[C[C[C[C[C[C[D[D[D[D[D[D[D[D[D[D[D[D[D[D[D[D[D[D[D[D[D[D[D[D[D[D[D[D[D[D[D[D[D[2~w[DwasCapturing information from the WWW)

Fri Jan 26 23:11:10 PST 2007

Len wrote:
> Paul's points are excellent.  Archive.org is doing a great service,
> and they've gotten considerably better at making acceptable copies of
> some pretty fancy websites.  But it breaks on some sophisticated
> scripts (particularly Java-based) and most dynamically-generated
> pages.  We have most of our collections catalog online, for example,
> and archive.org captures none of it.
> At 07:21 AM 1/23/2007, Paul McJones wrote:
> >Here's a "random example": look for www.bobbemer.com at
> >www.archive.org. It says the site's robots.txt blocked their spider.
> >Bob Bemer had a very interesting web site that went offline a few
> >months after he died, in 2004. Perhaps the Internet Archive actually
> >has a copy, but who's to know; Al and I and various other people
> >made copies. The reason the Museum can and should maintain archives
> >of web sites include:
> >
> >    Having specific goals of what's worth keeping (curation);
> >    Having a commitment for long-term preservation;
> >    Having a commitment to work with authors of
> > historically-relevant server-based web sites to mirror their
> > underlying databases, not just to crawl their content;
> >    Etc.

Something that both Len and Paul touch on above, and which I know that
Al has mentioned in the past too, is that the website is not
always exactly the same as the archive contents.

This is a good thing for users if done nicely. In my "Attic and
Parlor" presentation last May I tried to cast a general framework
towards web-accessible archives:

1. The web can be used as a "presentation layer" for accessing an
archive.

2. The details of the presentation layer will depend much on the
original formatting of the data.

3. The presentation layer should translate the original formatting
of the data into something at least somewhat viewable and browsable
over the web.

4. Lots of interesting stuff [to SCC, to CHM, to the world in general]
from decades ago was recorded in funky formats that are not exactly
web-accessible. Examples: Non-ASCII character sets. Word or character
lengths larger than today's popular 8-bit byte. Even filesystems
that are not accessed by filenames, but are content-addressable, etc.

5. Examples of the web-specific translation might include hyperlinking
to other similar documents in the archive (example: other files in
the archive with the same filename), translation from line-printer
formatting to appropriate HTML, and some sort of framing
window that shows the place of the document in the archive in
the archivists mindset.

6. A really well-indexed and cross-hyperlinked archive would link
to information outside that particular archive that is relevant.
This is a difficult goal but there is knowledge in the world
(e.g. archive.org, Google, etc.) on how to do this both automatically
and with relevance. Everybody will disagree about the automaticity
and relevance of any particular method.

7. In the "Attic and the Parlor" mindset, a well-curated parlor
can link to the deep archives on the web using consistent and
not-changing-often URL's.

Combining #6 and #7, knowing what Parlors refer to what files in
the Attic is important to the people who try to organize the
attic. Looking at webserver logs with simple tools helps me do
this.

8. Links to tools and the "raw bits" as necessary and appropriate.
The tools used to build the website should be provided in source
form, along with the "raw bits", to aid in mirroring in the archive
contents (as opposed to simple mirroring of the presentation layer).
The "raw bits" are often already there at many web-based archives
because these are used in the various common simulator/emulators out
there. In other cases the "raw bits" are in a more abstract form
(more complicated than a raw disk image or raw tape image).

Combining #8 and #1-#5, the construction of emulators which
understand the "raw bits", as well as tools existing outside
the emulation world but more in the translation world, are things
that a good archive/presentation website will often consider
primary goals. There is a real dynamicism possible: interest
in old archictures, along with tools and emulators, can quickly
snowball into a world-wide effort to share and develop emulators
and tools for the raw bits. It is exciting to be a part of these efforts.

9. It is important to consider that the web has only been around
a couple of years and that other presentation layers may become
relevant or may be more relevant. Trivial example already true:
lots of documents not being exchanged in HTML but in PDF. Google
is ahead of us in at least some cases by being able to crawl
non-HTML document formats, but despite their efforts some
file formats are a lot harder to crawl/search than HTML.

The 9 points above were what I was trying to push across at
"The Attic and the Parlor" last May. I try to adhere to the
9 points in the archives I maintain, but do not completely
reach a level of personal self-satisfaction in every single
respect, and I must do even poorer in how others view my
archives.

Larry asked:

>>>Why should CHM do its own spidering? What am I missing?

The CHM/SCC should spider the various archives already out there
with a specific mind towards my item #8 above: having both
the raw bits (for the upcoming day when the web is an anachronism
and not the latest hippest thing - some in silicon valley
say we reached this stage years ago!) and the tools. As Al
and others have pointed out, the various existing web spiders
do not necessarily understand how the "raw bits" and tools are
the basis of each archive, and that the processed documents
are just a presentation layer.

One could imagine (and I just barely touched on this last May
with I'm not sure anyone in the audience realizing what I was
trying to say) a list of CHM/SCC-suggested tags which are
content-rich (as opposed to presentation-layer type stuff)
to be used in web-accessible archives and which are relevant
to finding the 

A: Raw bits
B: Tools
C: Context in that archvie
D: Context in other "Attics" and "Parlors"
E: Relevance to the world

As a follow-up: I really enjoyed casting the archives I maintain
into the "Attic and the Parlor" context and meeting those who
run other attics and the parlors last May. Most of what I write
above, I never really attempted to say or write down or present
to the world all in one place. I hope the SCC and
CHM continue the "Attic and the Parlor" context and expand
to other meetings of minds in other frameworks too.

Tim.