Batch downloading of the techpub library...

IRIX and IRIX software discussion including open source and commerical offerings.
Forum rules
Any posts concerning pirated software or offering to buy/sell/trade commercial software are subject to removal.
User avatar
ZoontF
Posts: 332
Joined: Fri Nov 07, 2003 2:07 pm
Location: Middle o' Vermont
Contact:

Batch downloading of the techpub library...

Unread postby ZoontF » Thu Dec 09, 2004 8:17 pm

Hi all,

ZaFunk mentioned that he and some other folks had wondered how to download all of the PDFs on techpubs.

I've taken a crack at the problem, and it works for me. But there is some evidence that it may not work properly for others. Here is the command.

Code: Select all

wget -r --accept="*.pdf,download.cgi*" \
--reject="browse.cgi,summary.cgi,init.cgi,help.cgi,feedback.cgi,shownew.cgi,listdocs.cgi" \
--domains=techpubs.sgi.com \
"http://techpubs.sgi.com/library/tpl/cgi-bin/browse.cgi?db=bks&coll=hdwr&pth=ALL" \
2> log.txt &

It works with "wget 1.9.1" on Debian Linux on my laptop, and "wget 1.9.1" on my Octane with IRIX 6.5.22m.

This command will specifically attack the hardware PDFs, and not the software, linux, or windows PDFs. The URL will have to be changed in order to get those.

I haven't done extensive testing. But, for me, after it runs for a while and downloads all of the html pages that have the link to download the PDF from, it then goes back and starts getting the PDFs. It may take 10-20 minutes before it gets to the PDFs.

If others who are interested could give the command a try, then we can figure out what if any problems exist. As it shows, the output is directed to a log file for you to monitor with "tail -f log.txt".

The URLs for the different techpub sections are:
-------------------------------------------------------
Hardware
http://techpubs.sgi.com/library/tpl/cgi ... wr&pth=ALL

IRIX 6.5,6.4,6.3,6.2,5.3
http://techpubs.sgi.com/library/tpl/cgi ... 50&pth=ALL
http://techpubs.sgi.com/library/tpl/cgi ... ks&pth=ALL
http://techpubs.sgi.com/library/tpl/cgi ... ks&pth=ALL
http://techpubs.sgi.com/library/tpl/cgi ... ks&pth=ALL
http://techpubs.sgi.com/library/tpl/cgi ... ks&pth=ALL

Linux
http://techpubs.sgi.com/library/tpl/cgi ... ks&pth=ALL

Windows
http://techpubs.sgi.com/library/tpl/cgi ... ks&pth=ALL
---------------------------

You should be able to put all of these links into a text file, and use that file for input to wget.

For example:

If a file "techpubs.txt" contains...

Code: Select all

http://techpubs.sgi.com/library/tpl/cgi-bin/browse.cgi?db=bks&coll=hdwr&pth=ALL
http://techpubs.sgi.com/library/tpl/cgi-bin/browse.cgi?db=bks&coll=0650&pth=ALL
http://techpubs.sgi.com/library/tpl/cgi-bin/browse.cgi?coll=0640&db=bks&pth=ALL
http://techpubs.sgi.com/library/tpl/cgi-bin/browse.cgi?coll=0630&db=bks&pth=ALL
http://techpubs.sgi.com/library/tpl/cgi-bin/browse.cgi?coll=0620&db=bks&pth=ALL
http://techpubs.sgi.com/library/tpl/cgi-bin/browse.cgi?coll=0530&db=bks&pth=ALL
http://techpubs.sgi.com/library/tpl/cgi-bin/browse.cgi?coll=linux&db=bks&pth=ALL
http://techpubs.sgi.com/library/tpl/cgi-bin/browse.cgi?coll=nt&db=bks&pth=ALL

one entry per line, then the following wget command will use it as input, and as per the "-nd" option, dump all PDFs into the current directory.

Code: Select all

wget -r --accept="*.pdf,download.cgi*" \ --reject="browse.cgi,summary.cgi,init.cgi,help.cgi,feedback.cgi,shownew.cgi,listdocs.cgi" \
--domains=techpubs.sgi.com -nd -i techpubs.txt 2> log.txt &


If someone else already figured this out, sorry for duplicating effort.

User avatar
jan-jaap
Donor
Donor
Posts: 4953
Joined: Thu Jun 17, 2004 11:35 am
Location: Wijchen, The Netherlands
Contact:

Unread postby jan-jaap » Fri Dec 10, 2004 1:20 am

I did something like that a few monts ago when I heard rumours that all pre-6.5 material would be removed. My script wasn't very sophisticated, and (accidently) also downloaded all the manpages, so it ran for a day and a half (on a 4Mbit connection :oops: ).

I ended up with ~ 2.4GB of PDF's. Of course, the pre-6.5 PDF's are still there.

Another worthy target is the IRIX developer toolbox. This used to be toolbox.sgi.com, but has recently been moved to access.sgi.com, and despite several complaints, downloads are still broken. Fortunately, the individual files of an archive can still be downloaded, so I hacked something to do that. There's about 1.4GB of information to be downloaded there. Be quick, before somebody at SGI decides to remove, rather than fix it.

In general, I try to have a copy of everything I can lay my hands on, because it's going to disappear eventually, and there have been too many times that I cursed myself because I ddn't download the information when I still could...

User avatar
ZoontF
Posts: 332
Joined: Fri Nov 07, 2003 2:07 pm
Location: Middle o' Vermont
Contact:

Unread postby ZoontF » Fri Dec 10, 2004 3:33 am

jan-jaap wrote:In general, I try to have a copy of everything I can lay my hands on, because it's going to disappear eventually, and there have been too many times that I cursed myself because I ddn't download the information when I still could...


Yeah, sadly one of the web's greatest features - its dynamic and lively content - also gives rise to one of its most frustrating problems - data entropy.

When I studied computer ethics we covered a concept called "the datasphere", which loosely was the collection of all information about everything everwhere. The prime directives the ethical system for the datasphere involved never destroying data, unless it was massively redundant, and always trying to increase the amount of data we have. This is not meant just in terms of computers, but everywhere. Killing a person is bad because, data-wise, there is only one of that person, and now the data-collection that was that person is gone. I'm not doing it justice, but the idea hopefully gets through - we don't want to lose stuff, no matter how seemingly trivial it is to some people, it is always useful in the end.

Unfortunately, this doesn't mesh well with my pocketbook and my disk capacity when it comes to the net :)

Oh well... we do what we can.

I've been kicking myself because I never made a local mirror of any of the three major NeXTSTEP software archives, and now I can't seem to get through to any of them. How will I feed my NeXTStation?

Data sphere Paper

User avatar
zafunk
Posts: 1060
Joined: Thu Feb 12, 2004 11:51 pm
Location: Victoria, BC, Canada

Unread postby zafunk » Fri Dec 10, 2004 5:43 am

Just a note here. When I tried Zoontf's solution under IRIX, I found that the "--accept" parameter caused wget to return an answer of "no match". By removing it, I successfully mirrored a copy of all the PDF/HTML files that I was after.

Thanks Zoontf!

User avatar
Hakimoto
Moderator
Moderator
Posts: 2648
Joined: Sun Mar 30, 2003 4:29 am
Location: Nijmegen, Netherlands, Europe
Contact:

Unread postby Hakimoto » Fri Dec 10, 2004 9:35 am

Cool... I used to hand pick stuff out.. but then quite a few amongst us will always keep their entire mirrors, hm?

Any size estimate on the whole techpubs? 5 GB?
The Bandito wrote:In a few years, no doubt, you'll be able to buy a computer,
software and operating system that will match the capabilities
of your current Amiga at about the price you paid for the
Amiga way back when. But you can smile to yourself, knowing
that you were touching the future years before the rest of
the world. And that other computers and operating systems
will do with brute force what the Amiga did years before with
grace, elegance and style.


Eroteme.ch - my end of the internet...

User avatar
sum][one
Posts: 573
Joined: Fri Jun 06, 2003 4:25 pm
Location: Italy
Contact:

Unread postby sum][one » Fri Dec 10, 2004 9:59 am

seams like with my current build on the challenge (1.8.0) this doenst work.. dunno which option might not be implemented in the old release.. i had no time to check the problems.. i just turned out that doesnt work.

i'll try with a more recent release cause i was also looking for this.. thanks :)
----
:: jean-claude
:: mimgfx dot com
----

Revlef
Posts: 108
Joined: Fri Mar 19, 2004 12:25 pm
Location: Arnhem, the Netherlands

Unread postby Revlef » Fri Dec 10, 2004 12:52 pm

Hi,

This is great i always wanted all the sgi pdf's about hard and software but it was time asuming.

so i try

Code: Select all

wget -r --accept="*.pdf,download.cgi*" \
--reject="browse.cgi,summary.cgi,init.cgi,help.cgi,feedback.cgi,shownew.cgi,listdocs.cgi" \
--domains=techpubs.sgi.com \
"http://techpubs.sgi.com/library/tpl/cgi-bin/browse.cgi?db=bks&coll=hdwr&pth=ALL" \
"http://techpubs.sgi.com/library/tpl/cgi-bin/browse.cgi?coll=0640&db=bks&pth=ALL" \
"http://techpubs.sgi.com/library/tpl/cgi-bin/browse.cgi?coll=0630&db=bks&pth=ALL " \
"http://techpubs.sgi.com/library/tpl/cgi-bin/browse.cgi?coll=0620&db=bks&pth=ALL " \
"http://techpubs.sgi.com/library/tpl/cgi-bin/browse.cgi?coll=0530&db=bks&pth=ALL " \
2> log.txt &



great work ! it drops all the pdf files in the current directory.
i have neko_wget-1.9.1.tardist installed

great work ZoontF ! :D

User avatar
GeneratriX
Posts: 4250
Joined: Tue Oct 21, 2003 2:07 am
Location: Rosario / Santa Fe / República Argentina

Ultra-Lazy...

Unread postby GeneratriX » Fri Dec 10, 2004 3:29 pm

Well, let's see...

Assumming that you're an Ultra-Lazy man for all those things not strictly related to your interests area (as in example -Networks- in my case!)... and always bloated of work...

What would you do if your SGI box is behind a proxy?

I can recall some kind of moddifiers for wget, of the kind: '--proxy', or anything else, but never tried on the practice... and I'm heavily interested on get my own "Developers ToolBox" local mirror, including ALL, even those files less relevants! :D

Anyone helping with the full command line for this one? :lol:

Thanks in advance!
Cheers! ;)

User avatar
ZoontF
Posts: 332
Joined: Fri Nov 07, 2003 2:07 pm
Location: Middle o' Vermont
Contact:

Unread postby ZoontF » Fri Dec 10, 2004 11:25 pm

Hmmmm.... I do not have a proxy server setup right now to test doing this. It is a good question. I'll see if I can rig one up to test this stuff out. Out of curiosity, what kind of proxy server are you running?

zone
Posts: 106
Joined: Mon Nov 15, 2004 3:43 am
Location: /usr/people/zone

Unread postby zone » Sat Dec 11, 2004 8:40 am

ZoontF wrote:
Oh well... we do what we can.

I've been kicking myself because I never made a local mirror of any of the three major NeXTSTEP software archives, and now I can't seem to get through to any of them. How will I feed my NeXTStation?


This is rather sad... for good or bad I didn't get ever step in NeXT... but Im highly compassionate here, since this may happend (within our lifetime) to all of us once with SGI and IRIX (at least on workstations if not with big systems)...
Therefore I just keep on collecting stuff, making backups, burning CDees, etc... and in short making personal knowledge base to be prepared for (Biblical) day when IRIX on SGI is gone.
This goes in my case well beyond of HW/SW issues, so... I have strategy even for building alternative solar & wind powered energy sources on accus to run *an emergency fraction* of my sgi-s once when oil reserve of Iraq would be sucked/dryed out and we're goin to face Dark Age/s with possible severe political restrictions not only on power consumption (and general accesibility to resources) as *free citizens*, but quite likely on all manifestations of individual freedom as well...
Saying that it's anyone responsibility to learn, collect, preserve and preferably to develop (or to contribute in new developments) as much as one can IMHO.
Thanks to Neko we have this insightfull (central) meeting point to exchange and enhance our knowledge, codes and sources of our prefered platform. But I belive that anyone should take as much knowledge as she/he can bear. Think that in perspective Neko as valuable resourece should be wgeted mirrored/distributed as well, just for the case...

Not that Im pesimistic but you may have idea in what kind of world we live nowadays...

User avatar
ZoontF
Posts: 332
Joined: Fri Nov 07, 2003 2:07 pm
Location: Middle o' Vermont
Contact:

Unread postby ZoontF » Sat Dec 11, 2004 11:30 am

Yes, I have dreamed of what it would take to mirror Nekochan as well...

zone
Posts: 106
Joined: Mon Nov 15, 2004 3:43 am
Location: /usr/people/zone

Unread postby zone » Sat Dec 11, 2004 12:26 pm

ZoontF wrote:Yes, I have dreamed of what it would take to mirror Nekochan as well...


Have no idea, but dual Origin and o2 *on stereoids* ;) (if Origins are goin down) is current setup as it seems... Think that requires also Good Karma and lot of Good Will... that may be part of recipe...
Neko, any additions, comment please.

User avatar
nekonoko
Site Admin
Site Admin
Posts: 8145
Joined: Thu Jan 23, 2003 1:31 am
Location: Pleasanton, California
Contact:

Unread postby nekonoko » Sat Dec 11, 2004 4:33 pm

zone wrote:Neko, any additions, comment please.


The downloads are mirrored at several sites already but the forum databases can't be directly mirrored due to security concerns.
Twitter: @neko_no_ko
IRIX Release 4.0.5 IP12 Version 06151813 System V
Copyright 1987-1992 Silicon Graphics, Inc.
All Rights Reserved.

zone
Posts: 106
Joined: Mon Nov 15, 2004 3:43 am
Location: /usr/people/zone

Unread postby zone » Sat Dec 11, 2004 5:39 pm

nekonoko wrote:
zone wrote:Neko, any additions, comment please.


The downloads are mirrored at several sites already but the forum databases can't be directly mirrored due to security concerns.


Didn't think about forum as database mirorring, that would be mess if not nightmare to maintain. What may be nice to have archived someday is let say year or month archives in simple format, no database stuff, could be just huge text file in say pretty much in fashion of old SGI FAQ's in newsgroups... something which should be simple selfcontaining tar(.gz) file serchable just with find (C^F) on level of Jot and/or Nedit.

Downloads are cool and it's nice to have 'em mirrored, however forums are real Jewel as Knowledgebase, and (maybe even) more precious since it seems that this may be essence of sgi community and great source of Wisdom on subject... I lost my interest in posting on newsgroups long time ago since I had discover spam penality... So I become lurker there, who was contacting people in need of answers directly instead of posting solution/sugestion to public board as newgroups... furthermore Google.groups NEW! nowdays unfortunately looks crippled... in past few days I found again joy of contributing to public forums thanks to nekochan forums, and that's nice indeed... well I would like to see these eight or nine leafs of comp.sys.sgi newgroups alive but Im afraid that this *groups NEW!* as *extend and embrace* interface may result in losing my interest there, even to be lurker... I dont get it, and I can't see what is realy *improved* there beyond hiding of mail address of contibutors, with search and navigation crippled... So, I may decide just not to waste my time with it at some point... sad, but true.
OKee back to nekochan, this may be better palce than comp.sys.sgi.* if is not so already, however since this may/&will grow in amazing source and community... backups, archives, etc would be nice to have 'em distributed. It's not issue if Neko is having few days lasting crash as we had it recently with Origins, but imagine that Neko is just dissapearing overnight... as Milton's VW320 site... wouldn't be fun... whatsmore, when we're on Milton, I sent email to Him few weeks ago, 'coz we were in contact, last time in July I think, but I had no answer from him... not big deal you may say, but Im questioning my self; what happend to Milton, is everything OK, is he alive (I dont like even to think about Him that way but...)... yea, can anyone may tell me that Milton is fine, life&kicking involved in new project eg; farm of kangaroos or termites even somewhere at south pole... well, would be glad to hear.
Same applies here, people are getting in touch and making contacts on very narrow bandwidth, by exchanging couple of thoughts and finding themselfs in special kind of contact, in shity webs nowdays this may/is nice oasis to relax and get someone who understand what you're talking about, not to say that you may also learn someting there... issue is that we don't need terrorists, dubya, or even some greater natural diseaster... shit may happend... and if something is valuable, certainly Community here is, beside codes, binaries, and docs and all that what can be *burned'n'archived* as hardcopy there'll be always greater things, as *being there* not always archiveable (completly)... but. what Im goin to say - kind of public/community distributed backup won't be bad...there is many way how this can be accomplished, and we may put it (or we should) on public vote to see if there is interest to have forums and/or other parts of nekochan.net offered on regular basis for mirroring in sense of net.backup.

Neko, sorry I forgot completly to ask; is there something what may interfere with certain copyright or any IP related issues in the sense that may generate problem in mirroring/distributing/providing nekochan.net beside fair/4personal use only policy.
Last edited by zone on Sat Dec 11, 2004 6:01 pm, edited 1 time in total.

User avatar
nekonoko
Site Admin
Site Admin
Posts: 8145
Joined: Thu Jan 23, 2003 1:31 am
Location: Pleasanton, California
Contact:

Unread postby nekonoko » Sat Dec 11, 2004 5:48 pm

zone wrote:Didn't think about forum as database mirorring, that would be mess if not nightmare to maintain. What may be nice to have archived someday is let say year or month archives in simple format, no database stuff, could be just huge text file in say pretty much in fashion of old SGI FAQ's in newsgroups... something which should be simple selfcontaining tar(.gz) file serchable just with find (C^F) on level of Jot and/or Nedit.


Sure, if someone can put together a tool to do that I don't see a problem with running it once a month or so.
Twitter: @neko_no_ko
IRIX Release 4.0.5 IP12 Version 06151813 System V
Copyright 1987-1992 Silicon Graphics, Inc.
All Rights Reserved.


Return to “IRIX and Software”

Who is online

Users browsing this forum: No registered users and 2 guests