This database should store basic information for each SGI document part-number, like title, directory location, pdf hashes and sizes, html.tgz hashes and sizes (if available), and keywords.
Filling the database is not that hard and i'm in the process of creating a python script to automate this. Jan-Jaap's shell script and c-program already has index.txt which lists the partnumbers together with the document title, which is really helpful. Alternatively, pdftotext can convert the first page into something closely resembling a document title.
I've created the following layout:
Code: Select all
partnr title location pdfsize pdfhash pages hastgz tgzsize tgzhash
text text text int text int int int text
id partnr keyword
int text text
Docs table is a unique entry for each sgi document part number. Hashes are straight up md5sum or any other flavor which is convenient. The keywords table pairs partnumbers with keyword strings
The datatypes are in sqlite format. There are no BOOLs or NVARCHAR() constructs, since sqlite doesn't care.
A single manual entry should look like this in the two tables:
Code: Select all
"007-2849-004", "Indigo 2 IMPACT Workstation Owner’s Guide", "manuals_hdwr/2000/007-2849-004/", 5862232, "6c436c40325fc2f9b0711b4f9e142ef2", 380, 1, 1251967, "8d1afbc539ec73857574ca2fe86ed693"
1, "007-2849-004", "indigo2"
2, "007-2849-004", "impact"
3, "007-2849-004", "owner"
4, "007-2849-004", "hardware"
I have a few comments and questions:
- There are language variants of certain documents so the SGI partnumber format isn't fixed to ???-????-??? , but may also be ???-????-???JP or a 3 letter variant thereof. This needs to be taken into account. Should i make a language entry in the docs table or shall i just list the language specific documents separately?
- I am not sure if all .html.tgz documents have a unique .pdf counterpart. Maybe Jan-Jaap or any other techpubs collector can comment on this. If there are exceptions, i might have to change the flag hastgz into a tristate: 1 : has .html.tgz only , 2 : has .pdf only, 3 : has both. It would then also allow for a manual missing a digital form: 0 which is kinda important if we discover old documents.
- Are SGI document part numbers unique in itself, or. does every SGI document with a specific partnumber have the same hash? I have seen SGI orphaned documents by simply deleting them and replacing them with a new document, like i've reported in viewtopic.php?t=6345 . At least SGI has bumped the last three digits to -003 in that particular case. Jan-jaap's script does take orphaned documents into account, but i'm not sure if there are variants of documents floating around with different hashes.
i'll keep trodding away on this, but the database design is kind-of an important step, henc my call on input. I don't want to redesign it many times and i'm not a database guru, so help is appreciated.