[SCC_Active_Members] Capturing information from the WWW

Mon Jan 22 14:12:19 PST 2007

"H.M. Gladney" <hgladney at pacbell.net> wrote:
> Relative to the SPG emphasis on capturing stuff, doing so with a view to
> classifying, accessioning, obtaining authorization later, etc., it will from
> time to time be of interest to capture a large set of files in the directory
> tree hanging from some some interesting Web Page.
>
> There probably are several available tools to accomplish this.  In case you
> are interested and have not found such a tool, the current note identifies
> one that I started using a few days ago and find convenient.
>
> It is ShareWare called SB WebCamCorder and is available via
> http://www.sb-software.com/.  You are permitted to use it free of charge,
> but (in the ShareWare spirit) are encouraged to support the developer at the
> rate of $26.95.

I highly encourage anyone to use such tools in a manner considerate
of the server's network bandwidth, and of any 'robots.txt' limiting
spidering.

Looking at the online documentation for "SB WebCamCorder" I saw nothing
indicating that it obeys robots.txt (doesn't mean it won't, but I
have banned roughly 50 spiders from my sites because they do not
obey robots.txt.

A commonly available, high quality, open source tool that does all
the above and more is GNU Wget: http://www.gnu.org/software/wget/

Tim.