Advice needed on techpubs access: designing a sqlite database

New to SGIs? Need help getting things going? This is the forum for you!
Forum rules
Any posts concerning pirated software or offering to buy/sell/trade commercial software are subject to removal.
User avatar
dexter1
Moderator
Moderator
Posts: 2743
Joined: Thu Feb 20, 2003 6:57 am
Location: Zoetermeer, The Netherlands

Advice needed on techpubs access: designing a sqlite database

Unread postby dexter1 » Fri May 13, 2016 3:28 am

In order to come up with a way of re-hosting techpubs documentation in some form or another, i've thought about making a sqlite database for all the documents i've gobbled up using Jan-Jaap's scripts. This way i can make some kind of script which retrieves the correct pdf when entering a title or keyword or, the horror, a PHP web-frontend-monstrosity. :shock:

This database should store basic information for each SGI document part-number, like title, directory location, pdf hashes and sizes, html.tgz hashes and sizes (if available), and keywords.
Filling the database is not that hard and i'm in the process of creating a python script to automate this. Jan-Jaap's shell script and c-program already has index.txt which lists the partnumbers together with the document title, which is really helpful. Alternatively, pdftotext can convert the first page into something closely resembling a document title.

I've created the following layout:

Code: Select all

docs table:
partnr title location pdfsize pdfhash pages hastgz tgzsize tgzhash
text   text  text     int     text    int   int    int     text

keywords table:
 id partnr keyword
int text   text

Docs table is a unique entry for each sgi document part number. Hashes are straight up md5sum or any other flavor which is convenient. The keywords table pairs partnumbers with keyword strings
The datatypes are in sqlite format. There are no BOOLs or NVARCHAR() constructs, since sqlite doesn't care.

A single manual entry should look like this in the two tables:

Code: Select all

docs table:
"007-2849-004", "Indigo 2 IMPACT Workstation Owner’s Guide", "manuals_hdwr/2000/007-2849-004/", 5862232, "6c436c40325fc2f9b0711b4f9e142ef2", 380, 1, 1251967, "8d1afbc539ec73857574ca2fe86ed693"

keywords table:
1, "007-2849-004", "indigo2"
2, "007-2849-004", "impact"
3, "007-2849-004", "owner"
4, "007-2849-004", "hardware"


I have a few comments and questions:

- There are language variants of certain documents so the SGI partnumber format isn't fixed to ???-????-??? , but may also be ???-????-???JP or a 3 letter variant thereof. This needs to be taken into account. Should i make a language entry in the docs table or shall i just list the language specific documents separately?

- I am not sure if all .html.tgz documents have a unique .pdf counterpart. Maybe Jan-Jaap or any other techpubs collector can comment on this. If there are exceptions, i might have to change the flag hastgz into a tristate: 1 : has .html.tgz only , 2 : has .pdf only, 3 : has both. It would then also allow for a manual missing a digital form: 0 which is kinda important if we discover old documents.

- Are SGI document part numbers unique in itself, or. does every SGI document with a specific partnumber have the same hash? I have seen SGI orphaned documents by simply deleting them and replacing them with a new document, like i've reported in viewtopic.php?t=6345 . At least SGI has bumped the last three digits to -003 in that particular case. Jan-jaap's script does take orphaned documents into account, but i'm not sure if there are variants of documents floating around with different hashes.

i'll keep trodding away on this, but the database design is kind-of an important step, henc my call on input. I don't want to redesign it many times and i'm not a database guru, so help is appreciated.
:Crimson: :PI: :Indigo: :O2: :Indy: :Indigo2: :Indigo2IMP:

User avatar
jan-jaap
Donor
Donor
Posts: 4940
Joined: Thu Jun 17, 2004 11:35 am
Location: Wijchen, The Netherlands
Contact:

Re: Advice needed on techpubs access: designing a sqlite database

Unread postby jan-jaap » Fri May 13, 2016 4:52 am

That mirror script of mine was really a rather unsophisticated hack. But it worked and that's something. I wanted to avoid (re)downloading files when they are in multiple categories so that's why it used a pool + hard link concept. Anything in the pool but not in any collection must then be orphaned.

I later found that there was on the techpubs site a booklist.txt file for every collection (os_version / category):

Code: Select all

<COLLECTION>
<BOOK NAME="ASO_UG" ALIAS="Audio/Serial Option User's Guide" SGITYPE="sgi_html" SGIVERSION="007-2645-001" SGIBKADDR="techpubs@sgi.com" SGIGROUP="HTML">
</BOOK>
<BOOK NAME="ChallS_OG" ALIAS="CHALLENGE S Server Owner's Guide" SGITYPE="sgi_html" SGIVERSION="007-2314-002, 11,94 " SGIBKADDR="techpubs@sgi.com" SGIGROUP="HTML">
</BOOK>
<BOOK NAME="ClrC_AG" ALIAS="CASEVision/ClearCase Administration Guide" SGITYPE="sgi_html" SGIVERSION="007-1774-020 6/94" SGIBKADDR="techpubs@sgi.com" SGIGROUP="HTML">
</BOOK>
<BOOK NAME="Diskless_AG" ALIAS="Diskless Workstation Administration Guide" SGITYPE="sgi_html" SGIVERSION="007-0855-030, 08/93" SGIBKADDR="techpubs@sgi.com" SGIGROUP="HTML">
</BOOK>

....

<BOOK NAME="XFS_AG" ALIAS="Getting Started With XFS Filesystems" SGITYPE="sgi_html" SGIVERSION="007-2549-001" SGIBKADDR="techpubs@sgi.com" SGIGROUP="HTML">
</BOOK>

</COLLECTION>

(this is for IRIX 5.3 / SGI_Admin). So clearly a more efficient way to mirror the site was possible rather than traverse the entire site, 'wget' style. I have these booklist.txt files. Maybe these tags can also be used to categorize things? I wouldn't toss part# and release date into one field, though.

On a more general note: TechPubs wasn't the only source of SGI documentation. You could say it was the online equivalent of the IRIS Insight / InfoSearch you've got on every IRIX system since IRIX v4 or something. At some point in the 6.5.x lifecycle they switched from Insight to a HTML (browser based) system. It would even convert your existing Insight manuals to HTML. And InfoSearch will make your manpages and relnotes available as a 'website'.

So, given a comprehensive (ahem) collection of IRIX installation media you have this available:
IRIX v3.x: manpages + relnotes -> HTML
IRIX v4.x: manpages + relnotes -> HTML + Manuals in Insight format -> HTML
IRIX v5.x: manpages + relnotes -> HTML + Manuals in Insight format -> HTML + TechPubs manuals
IRIX v6.x: manpages + relnotes -> HTML + Manuals in Insight format -> HTML + TechPubs manuals

Techpubs had manuals beyond what you get with IRIX media. TechPubs had manpages and relnotes too. I have them, but 'wget' mirrored style, so they would have to be "unprocessed" to reproduce the original text documents.

But if you toss everything into the mix, you can create extra functionality, e.g. navigate between multiple versions of a single document.

I've been toying with the idea of reviving something like this. But I don't think putting a real IRIX InfoSearch server online in 2016 is a sane idea.
:PI: :Indigo: :Indigo: :Indy: :Indy: :Indy: :Indigo2: :Indigo2: :Indigo2IMP: :Octane: :Octane2: :O2: :O2+: Image :Fuel: :Tezro: :4D70G: :Skywriter: :PWRSeries: :Crimson: :ChallengeL: :Onyx: :O200: :Onyx2: :O3x02L:
To accentuate the special identity of the IRIS 4D/70, Silicon Graphics' designers selected a new color palette. The machine's coating blends dark grey, raspberry and beige colors into a pleasing harmony. (IRIS 4D/70 Superworkstation Technical Report)

User avatar
jan-jaap
Donor
Donor
Posts: 4940
Joined: Thu Jun 17, 2004 11:35 am
Location: Wijchen, The Netherlands
Contact:

Re: Advice needed on techpubs access: designing a sqlite database

Unread postby jan-jaap » Fri May 13, 2016 5:22 am

dexter1 wrote:- I am not sure if all .html.tgz documents have a unique .pdf counterpart. Maybe Jan-Jaap or any other techpubs collector can comment on this. If there are exceptions, i might have to change the flag hastgz into a tristate: 1 : has .html.tgz only , 2 : has .pdf only, 3 : has both. It would then also allow for a manual missing a digital form: 0 which is kinda important if we discover old documents.

There are many PDFs which do not have a corresponding .tgz. But the other way around happens too, e.g. 007-1733-040 is html only. You might consider "Insight" a third format.
:PI: :Indigo: :Indigo: :Indy: :Indy: :Indy: :Indigo2: :Indigo2: :Indigo2IMP: :Octane: :Octane2: :O2: :O2+: Image :Fuel: :Tezro: :4D70G: :Skywriter: :PWRSeries: :Crimson: :ChallengeL: :Onyx: :O200: :Onyx2: :O3x02L:
To accentuate the special identity of the IRIS 4D/70, Silicon Graphics' designers selected a new color palette. The machine's coating blends dark grey, raspberry and beige colors into a pleasing harmony. (IRIS 4D/70 Superworkstation Technical Report)

tingo
Donor
Donor
Posts: 319
Joined: Sat Jun 26, 2010 5:40 pm
Location: Oslo, Norway

Re: Advice needed on techpubs access: designing a sqlite database

Unread postby tingo » Wed Jun 08, 2016 4:47 am

Quick question: why wouldn't a wiki (MediaWiki?) be a suitable way to host the info (and maybe the PDF files too)?
Torfinn

User avatar
dexter1
Moderator
Moderator
Posts: 2743
Joined: Thu Feb 20, 2003 6:57 am
Location: Zoetermeer, The Netherlands

Re: Advice needed on techpubs access: designing a sqlite database

Unread postby dexter1 » Wed Jun 08, 2016 6:44 am

Quick answer, there are about 6329 pdf and 3480 html.(t)gz documents in the techpubs archive, well, at least in my copy.

If you want to host such a large collection of documents in a cheap and timely fashion, you need something lightweight and easy on a database or webserver, so a few static pages with title, part-number and pdf download link or a link to the html documents is quickly made by a script :

This is exactly what Jan-Jaap did.

A mediawiki entry for all documents would be really nice to have, especially when dealing with the history and revisions of certain documents, but there aren't enough people to maintain such a huge number of pages. Also the mediawiki can host pdf documents, but it's less clear when it comes to html pages. What to do with them? Also, how to automate the download of these files into the mediawiki database and what metadata are you going to put there? These are significant problems when trying to attempt it this way.
:Crimson: :PI: :Indigo: :O2: :Indy: :Indigo2: :Indigo2IMP:

gijoe77
Posts: 242
Joined: Sat Jun 21, 2003 2:20 pm
Location: NJ

Re: Advice needed on techpubs access: designing a sqlite database

Unread postby gijoe77 » Tue Aug 02, 2016 1:49 am

When did techpubs go down? I just noticed while trying get the doc to connect my octane to my VBOB grrrrrrrr!! I meant to pull a local mirror of techpubs because I knew it was going to go away at some point but once again I'm an hour late and a dollar short...

edit: ok found the links for the mirrors in a different thread - thanks JJ!


Return to “Getting Started, Documentation, Tips & Tricks”

Who is online

Users browsing this forum: Ahrefs [Bot] and 1 guest