I powered the machine on and went for a nap (and forgot to actually boot it so it sat at a prompt for five hours). When I got up and booted the system ten minutes ago it went down before the OS could do anything.
Starting up the system in single user mode...
Loading dksc(0,1,8)/sash: 896+111372+16725+3848 entry: 0xa80000001a64791c
3938268+850980 entry: 0xa8000000000076e0
PANIC: Bad IO Adaptor type 32 slot 3 adap 2
Status register: 0xa2<IPL=8,KX,UX,MODE=KERNEL>
Cause register: 0xa808<CE=0,IP8,IP6,IP4,EXC=RMISS>
Exception PC: 0xa800000000006064, Exception RA: 0xa8000000000060bc
Read TLB miss exception, bad address: 0x3a0
*** Error/TimeOut Interrupt(s) Pending: 0x1000 ==
Parity error on data from D-chip [15:0]
VID #0's ARCS PDA: &pda 0xa800000001898ba0, ®s 0xa80000000189f3a0, magic 0
vid 0, pid 8, init_sp 0x0, fault_sp 0xa800000001996120, stack_mode 1
mode_sv 0, EPC_sv 0xa800000000006064, AT_sv 0x0, badvaddr_sv 0x3a0
ErrEPC_sv 0x640420000242400, CacheErr_sv 0x1de1ff1c, cause_sv 0xa808, v0_sv 0x
SP_sv 0xa8000000003ae200, SR_sv 0xa2, exc_sv 0x4, return_addr_sv 0xa8000000000
notfirst 0x1, firstEPC 0xa800000000006064, nofault 0x0
PANIC: Unexpected exception
Hmm. I powered down and went to reseat the IO4. The main regulator blocks seemed excessively hot (nearly burned my hand) but the PROM didn't record temperature issues. After reseating the IO4 it started to boot, then it tanked again.
++FRU ANALYSIS BEGIN
++ FRU Analysis Summary
++ IO4 BOARD
++ IO4 board in slot 3: 70% confidence.
++FRU ANALYSIS END
HARDWARE ERROR STATE:
+ IP25 in slot 2
+ CC in Cpu Slot 2, cpu 0
+ CC ERTOIP Register: 0x2000
+ 13:Parity error on data from D-chip [31:16]
+ CC Error Address Register: 0x5046bbd8020
+ cause: read response error(1)
+ address: 0x46bbd8020
+ IO4 board in slot 3
+ IA IBUS Error Register: 0x50800
+ 11: PIO ReadResponse Data Error
+ 18..16: IOA number of Transaction: 5 (DANG)
+ IA EBUS Error Register: 0x2
+ 1: My DATA_ERROR Received
DOUBLE PANIC: CPU 0: TLBMISS: KERNEL FAULT
PC: 0xa8000000000fe9e8 ep: 0xa8000000003ae098
EXC code:8, `Read TLB Miss '
Bad addr: 0xc0000fc000000000, cause: 0xa008<CE=0
Reboot started from CPU 0
Could this be a thermal issue (I am running an extra fan in the shop right now to keep air moving around) or when the -12v had a tant blow a month or so ago might that of caused something else to go screwy?
Edited: I pulled the system apart and reseated everything, including the ram before restarting with the bare three boards. The MC3 started to making a squeal and put the machine into POKA FAIL mode so I switched it out with a spare (Thank you TriOx!) and was able to boot back up into single user and eventually bring the entire system up. I'm also running with the door open and a barn fan in front of it in the ol' Crimson style!
Still curious as to what might of happened. I'll try adding boards and reinstalling memory and see if the problem comes back.
EDIT 2: System came up with all the ram installed (though the slots were dirty so it took a bit of "persuasion"). Now attempting to reinstall the mezz boards.
EDIT 3: I think the second problem was that Irix does not like the ASO moving around after it's been installed and the system reconfigured. Once I switched the mezz boards around (I must of shuffled them when I repaired the VCAM) the system came up reliably and seeing how none of the previous errors were related to the ATM, Sirius or video boards we should be okay to reinstall everything else. I still don't know what caused the initial failure though..
EDIT 4: The system is completely up now. Crisis averted.