Today my O350 did something it never did running 24/7 for the last year: it crashed.
Reset from the L2, SYSLOG reveals:
Code:
Feb 8 14:24:23 6D:speedo sn0log: The following are messages stored in the flashlog from a previous system boot.
Feb 8 14:24:23 6D:speedo sn0log: Flashlog for /hw/module/001c01/node/hub/mon
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: HARDWARE ERROR STATE:
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: + Errors on node Nasid 0x0 (0)
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: + IO Board in /hw/module/001c01/io widget: 0xf serial:
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: + Bridge ASIC errors:
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: + Bridge interrupt status register: 0x5000
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: + INT_N status: 0x0
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: + 12: PCI device reported parity error
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: + PCI Error Upper Address Register: 0xb360001
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: + PCI Error Lower Address Register: 0x520bb680
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: + 14: PCI Bridge detected parity error
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: + PCI Error Upper Address Register: 0xb360001
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: + PCI Error Lower Address Register: 0x520bb680
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: + IO Board in /hw/module/001c01/io widget: 0xf serial:
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: + Bridge ASIC errors:
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: + Bridge interrupt status register: 0x5000
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: + INT_N status: 0x0
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: + 12: PCI device reported parity error
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: + PCI Error Upper Address Register: 0xb360001
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: + PCI Error Lower Address Register: 0x520bb680
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: + 14: PCI Bridge detected parity error
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: + PCI Error Upper Address Register: 0xb360001
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: + PCI Error Lower Address Register: 0x520bb680
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: + Errors on node Nasid 0x1 (1)
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: + IP35 in /hw/module/001c02/node [serial number MTA291]
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: + BEDROCK signalled following errors.
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: + BEDROCK PI 1 Error Interrupt Register: 0x100000
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: + 20: CPU B received uncorrectable error during uncached load
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: End Hardware Error State
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: ++FRU ANALYSIS BEGIN
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: No rules triggered: Insufficient data
Feb 8 14:24:23 5D:speedo sn0log: C Fatal:
Feb 8 14:24:23 5D:speedo sn0log: Timeout Histogram is empty.
Feb 8 14:24:23 5D:speedo sn0log:
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: ++FRU ANALYSIS END
Feb 8 14:24:23 5D:speedo sn0log: C Fatal: PANIC: CPU 2: PCI Bridge Error interrupt killed the system
Feb 8 14:24:23 5D:speedo sn0log: C Fatal:
Feb 8 14:24:23 5D:speedo sn0log: Dumping to /hw/module/001c01/IXbrick/xtalk/15/pci-x/0/3/scsi_ctlr/0/target/1/lun/0/disk/partition/1/block at block 0, space: 0x2000 pages
Feb 8 14:24:23 6D:speedo sn0log: End of flashlog for /hw/module/001c01/node/hub/mon
Feb 8 14:24:23 6D:speedo sn0log: End of flashlog messages.
At the time of the crash, I was busy copying a large TAR file over NFS to the system. Network is on an NC7770 in module1, disks are attached to an LS1064 in module2. I was about 40GB into the file copy.
I'm not using the ethernet port of the IO9, and while the system disk is attached to the IO9, it should have been more or less idle at the time of the crash. It certainly managed to dump to it, for what it's worth.
I have since copied another 50GB file over to the system; no problems. Should I start shopping around for an IO9, or am I simply the victim of a cosmic ray or other singularity?
Oh, and what's widget 0xf on the io board?
_________________
Now this is a deep dark secret, so everybody keep it quiet 
It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgiCurrently in commercial service:

(2x)

In the
museum: almost every MIPS/IRIX system.
Wanted: GM1 board for Professional Series GT graphics (030-0076-003, 030-0076-004)