O350 crashed with PCI parity error

SGI hardware problems, solutions, tips, hacks, etc.
Forum rules
Any posts concerning pirated software or offering to buy/sell/trade commercial software are subject to removal.
User avatar
jan-jaap
Posts: 4191
Joined: Thu Jun 17, 2004 11:35 am
Location: Wijchen, The Netherlands

O350 crashed with PCI parity error

Unread postby jan-jaap » Wed Feb 08, 2012 6:45 am

Today my O350 did something it never did running 24/7 for the last year: it crashed.

Reset from the L2, SYSLOG reveals:

Code: Select all

Feb  8 14:24:23 6D:speedo sn0log: The following are messages stored in the flashlog from a previous system boot.
Feb  8 14:24:23 6D:speedo sn0log: Flashlog for /hw/module/001c01/node/hub/mon
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: HARDWARE ERROR STATE:
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +  Errors on node Nasid 0x0 (0)
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +    IO Board in /hw/module/001c01/io widget: 0xf serial: 
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +      Bridge ASIC errors:
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +        Bridge interrupt status register: 0x5000
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +          INT_N status: 0x0
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +          12: PCI device reported parity error
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +        PCI Error Upper Address Register: 0xb360001
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +        PCI Error Lower Address Register: 0x520bb680
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +          14: PCI Bridge detected parity error
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +        PCI Error Upper Address Register: 0xb360001
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +        PCI Error Lower Address Register: 0x520bb680
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +    IO Board in /hw/module/001c01/io widget: 0xf serial: 
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +      Bridge ASIC errors:
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +        Bridge interrupt status register: 0x5000
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +          INT_N status: 0x0
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +          12: PCI device reported parity error
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +        PCI Error Upper Address Register: 0xb360001
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +        PCI Error Lower Address Register: 0x520bb680
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +          14: PCI Bridge detected parity error
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +        PCI Error Upper Address Register: 0xb360001
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +        PCI Error Lower Address Register: 0x520bb680
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +  Errors on node Nasid 0x1 (1)
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +    IP35 in /hw/module/001c02/node [serial number MTA291]
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +      BEDROCK signalled following errors.
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +        BEDROCK PI 1 Error Interrupt Register: 0x100000
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +          20: CPU B received uncorrectable error during uncached load
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: End Hardware Error State
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: ++FRU ANALYSIS BEGIN
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: No rules triggered:  Insufficient data
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: 
Feb  8 14:24:23 5D:speedo sn0log: Timeout Histogram is empty.
Feb  8 14:24:23 5D:speedo sn0log: 
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: ++FRU ANALYSIS END
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: PANIC: CPU 2: PCI Bridge Error interrupt killed the system
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: 
Feb  8 14:24:23 5D:speedo sn0log: Dumping to /hw/module/001c01/IXbrick/xtalk/15/pci-x/0/3/scsi_ctlr/0/target/1/lun/0/disk/partition/1/block at block 0, space: 0x2000 pages
Feb  8 14:24:23 6D:speedo sn0log: End of flashlog for /hw/module/001c01/node/hub/mon
Feb  8 14:24:23 6D:speedo sn0log: End of flashlog messages.


At the time of the crash, I was busy copying a large TAR file over NFS to the system. Network is on an NC7770 in module1, disks are attached to an LS1064 in module2. I was about 40GB into the file copy.

I'm not using the ethernet port of the IO9, and while the system disk is attached to the IO9, it should have been more or less idle at the time of the crash. It certainly managed to dump to it, for what it's worth.

I have since copied another 50GB file over to the system; no problems. Should I start shopping around for an IO9, or am I simply the victim of a cosmic ray or other singularity?

Oh, and what's widget 0xf on the io board?
Now this is a deep dark secret, so everybody keep it quiet :)
It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi

Currently in commercial service: Image :Onyx2:(2x) :O3x02L:
In the museum: almost every MIPS/IRIX system.
Wanted: GM1 board for Professional Series GT graphics (030-0076-003, 030-0076-004)

User avatar
[[C|-|E]]
Posts: 417
Joined: Wed Mar 02, 2011 1:37 am
Location: London - UK

Re: O350 crashed with PCI parity error

Unread postby [[C|-|E]] » Wed Feb 08, 2012 9:49 am

Huuum... I guess that if I were you I would wait to see what happens in the future. You may or may not suffer the problem again and maybe you are worried without reason :).
Image _Betty Blue_
R12000A 400 Mhz; 1 Gb RAM; 72 Gb 15K HDD; IRIX 6.5.29
CrystalEyes; Dial Box; O2Cam "ZEYE"; external Toshiba SD-M1711 DVD-ROM; Octane speakers;
Lock bar; SGI microphone.
Mods: PSU Noctua fan; internal Toshiba SD-M1401 DVD-ROM; Adaptec AIC-7880P SCSI card.

_REKIEM_I7_
Seasonic X 1250W PSU / Intel I7 2600k 4 x 5,00 Ghz / 2 x Gainward 2Gb GTX 560Ti Phantom 2 / 32 Gb DDR3 / Intel x25-M 160 Gb SSD and 10 extra Tb
_REKIEM_T5400_
875W PSU / 2 x Intel Xeon Harpertown 4 x 3,33 Ghz / 1 x EVGA Geforce 4Gb GTX 980 Supercloked / 32 Gb DDR2 667 ECC / Samsung 840 Series Pro 128GB SSD and 3 extra Tb
_Raspberry Pi_
:|

User avatar
hhoffman
Posts: 71
Joined: Fri Apr 01, 2011 7:45 am

Re: O350 crashed with PCI parity error

Unread postby hhoffman » Wed Feb 08, 2012 11:59 am

... interesting, maybe you remember, I had a similar crash copying around 200 GB over NFS to my tezro.
http://forums.nekochan.net/viewtopic.php?f=3&t=16726049

I think it is a NFS issue.

User avatar
jan-jaap
Posts: 4191
Joined: Thu Jun 17, 2004 11:35 am
Location: Wijchen, The Netherlands

Re: O350 crashed with PCI parity error

Unread postby jan-jaap » Wed Feb 08, 2012 12:19 pm

I tried a 50GB file using FTP, that worked. I have another 200GB file waiting, I'll try tomorrow with NFS and FTP.

I'm not sure I ever used NFS for such large transfers, normally I either use samba or FTP
Now this is a deep dark secret, so everybody keep it quiet :)
It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi

Currently in commercial service: Image :Onyx2:(2x) :O3x02L:
In the museum: almost every MIPS/IRIX system.
Wanted: GM1 board for Professional Series GT graphics (030-0076-003, 030-0076-004)

User avatar
miod
Posts: 348
Joined: Fri Oct 09, 2009 2:44 am
Location: Orgerus (France)
Contact:

Re: O350 crashed with PCI parity error

Unread postby miod » Wed Feb 08, 2012 1:15 pm

jan-jaap wrote:Oh, and what's widget 0xf on the io board?

It's the PIC PCI-X controller for the on-board IO9 and the PCI slots.
:Indigo:R4000 :Indigo:R4000 :Indigo:R4000 :Indigo2:R4400 :Indigo2IMP:R4400 :Indigo2:R8000 :Indigo2IMP:R10000 :Indy:R4000PC :Indy:R4000SC :Indy:R4600 :Indy:R5000SC :O2:R5000 :O2:RM7000 :Octane:2xR10000 :Octane:R12000 :O200:2xR12000 :O200: - :O200:2x2xR10000 :Fuel:R16000 :O3x0:4xR16000 :A350:
among more than 150 machines : Apollo, Data General, Digital, HP, IBM, MIPS before SGI, Motorola, NeXT, SGI, Solbourne, Sun...

User avatar
miod
Posts: 348
Joined: Fri Oct 09, 2009 2:44 am
Location: Orgerus (France)
Contact:

Re: O350 crashed with PCI parity error

Unread postby miod » Wed Feb 08, 2012 1:24 pm

jan-jaap wrote:Should I start shopping around for an IO9, or am I simply the victim of a cosmic ray or other singularity?

I'd vote for the cosmic ray.

The PIC PCI address error register doesn't make much sense - it is a 64 bit memory space address which does not even remotely look within the range of addresses IRIX (or the PROM) would set up any device with.

So this looks like a bogusly generated address to me, used as part of a bogus DMA transfer. Of course, if you were not using any device on this module at the time the machine paniced, this is quite suspicious.
:Indigo:R4000 :Indigo:R4000 :Indigo:R4000 :Indigo2:R4400 :Indigo2IMP:R4400 :Indigo2:R8000 :Indigo2IMP:R10000 :Indy:R4000PC :Indy:R4000SC :Indy:R4600 :Indy:R5000SC :O2:R5000 :O2:RM7000 :Octane:2xR10000 :Octane:R12000 :O200:2xR12000 :O200: - :O200:2x2xR10000 :Fuel:R16000 :O3x0:4xR16000 :A350:
among more than 150 machines : Apollo, Data General, Digital, HP, IBM, MIPS before SGI, Motorola, NeXT, SGI, Solbourne, Sun...

rwengerter
Posts: 87
Joined: Mon Nov 22, 2010 12:02 am
Location: Northern Bavaria, Germany

Re: O350 crashed with PCI parity error

Unread postby rwengerter » Thu Feb 09, 2012 3:11 pm

From PCI 2.3 spec:

"The following requirements also apply when the 64-bit extensions are used.
During address and data phases, parity covers AD[31::00] and C/BE[3::0]# lines
regardless of whether or not all lines carry meaningful information.
...
Parity is generated according to the following rules:
• Parity is calculated the same on all PCI transactions regardless of the type or form.
• The number of "1"s on AD[31::00], C/BE[3::0]#, and PAR equals an even
number.
• Parity generation is not optional; it must be done by all PCI-compliant devices."

In most cases the repetition of a PCI parity error of a PCI card that was working correct for months can be
prevented if the gold finger contacts of the PCI card are cleaned and the mechanical position of the PCI card is checked.
I recommend alcohol for cleaning the gold finger contacts.
:Fuel: 600 MHz, 2 GB RAM, 72 GB 15k RPM HD
:O2: 180 MHz

User avatar
jan-jaap
Posts: 4191
Joined: Thu Jun 17, 2004 11:35 am
Location: Wijchen, The Netherlands

Re: O350 crashed with PCI parity error

Unread postby jan-jaap » Fri Feb 10, 2012 1:25 am

Whatever it was, it appears to be a spurious error:

* FTPed some 250GB of data from a Linux (client) to the O350 (server): no problems, achieves nearly line speed of the Gbit network (disk systems can keep up at both ends).

* Transferred the same 250GB using NFS. Linux Debian 5 client, mount options 'vers=3,rsize=32768,wsize=32768'. It works, but is very slow. 200GB took 18 hours, 17 minutes and 35 seconds, so that's only ~ 3MB/s :shock:

* Transferred several dozen GB by SMB to the O350. Works pretty well too (~ 85 - 90MB/s, samba 3.6.3 on the O350, Windows 7 client).

If anyone knows the magic NFS arguments to speed up NFS between Linux and IRIX I'd like to hear from you, otherwise I'll forget about NFS. Not for fear of crashing the 350, but because I will be dead before the files are transferred :lol:
Now this is a deep dark secret, so everybody keep it quiet :)
It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi

Currently in commercial service: Image :Onyx2:(2x) :O3x02L:
In the museum: almost every MIPS/IRIX system.
Wanted: GM1 board for Professional Series GT graphics (030-0076-003, 030-0076-004)

hamei
Posts: 10216
Joined: Tue Feb 24, 2004 4:10 pm
Location: over the rainbow

Re: O350 crashed with PCI parity error

Unread postby hamei » Fri Feb 10, 2012 1:52 am

jan-jaap wrote:If anyone knows the magic NFS arguments to speed up NFS between Linux and IRIX I'd like to hear from you

While you're at it, tips for Windows to Irix via NFS would be good also. Irix <-> Solaris NFS is fast.

Just change everything over to CXFS ?

User avatar
hhoffman
Posts: 71
Joined: Fri Apr 01, 2011 7:45 am

Re: O350 crashed with PCI parity error

Unread postby hhoffman » Fri Feb 10, 2012 5:32 am

With IRIX client and CentOS server, I have around 90MB/s in both directions, using:

Code: Select all

exports on CentOS side: rw,async,no_root_squash
mount options IRIX side:  rw,rsize=8192,wsize=8192



Just change everything over to CXFS ?

Are you using CXFS?

User avatar
jan-jaap
Posts: 4191
Joined: Thu Jun 17, 2004 11:35 am
Location: Wijchen, The Netherlands

Re: O350 crashed with PCI parity error

Unread postby jan-jaap » Fri Feb 10, 2012 6:19 am

hhoffman wrote:With IRIX client and CentOS server, I have around 90MB/s in both directions, using:

Code: Select all

exports on CentOS side: rw,async,no_root_squash
mount options IRIX side:  rw,rsize=8192,wsize=8192

You're using a Linux server and an IRIX client, for me it's the other way around. But I'll try to fiddle a bit with the block size.

Just change everything over to CXFS ?

Not an option. First of all, CXFS costs real $$$, second, the Linux clients are buildbots, test systems etc. Easily discardable, rather volatile bunch. I don't want to have to deal with complicated things like CXFS here.

A dedicated subnet with jumbo frames would be a good idea, although both SMB and FTP can get close to wire speed using standard 1500byte frames.
Now this is a deep dark secret, so everybody keep it quiet :)
It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi

Currently in commercial service: Image :Onyx2:(2x) :O3x02L:
In the museum: almost every MIPS/IRIX system.
Wanted: GM1 board for Professional Series GT graphics (030-0076-003, 030-0076-004)

kubatyszko
Posts: 352
Joined: Sat Nov 12, 2011 3:18 am
Location: Tokyo
Contact:

Re: O350 crashed with PCI parity error

Unread postby kubatyszko » Fri Feb 10, 2012 8:54 pm

Try compiling something and see if it causes the box to crash,
my Octane2 starts to crash when I compile emacs for example.
I suspect in my case it's the SCSI controller - could be due to high number of IO's or something related.
[click for links to hinv] JP: :Fuel: |:O2: | :Indy: || PL: [ :Fuel: :O2: :O2+: :Indy: ]

ramq
Posts: 771
Joined: Fri Jan 25, 2008 6:06 am
Location: Sweden

Re: O350 crashed with PCI parity error

Unread postby ramq » Sat Feb 11, 2012 5:57 am

Since I have no IRIX hardware nearby at the moment I can't verify this, but have you tried the async option?
:O3200: :Fuel: :Indy: :O3x02L:

User avatar
Adrenaline
Posts: 542
Joined: Thu Feb 10, 2005 12:37 pm
Location: Laurel, MD USA
Contact:

Re: O350 crashed with PCI parity error

Unread postby Adrenaline » Sat Feb 11, 2012 8:35 am

jan-jaap wrote:
hhoffman wrote:With IRIX client and CentOS server, I have around 90MB/s in both directions, using:

Code: Select all

exports on CentOS side: rw,async,no_root_squash
mount options IRIX side:  rw,rsize=8192,wsize=8192

You're using a Linux server and an IRIX client, for me it's the other way around. But I'll try to fiddle a bit with the block size.

Just change everything over to CXFS ?

Not an option. First of all, CXFS costs real $$$, second, the Linux clients are buildbots, test systems etc. Easily discardable, rather volatile bunch. I don't want to have to deal with complicated things like CXFS here.

A dedicated subnet with jumbo frames would be a good idea, although both SMB and FTP can get close to wire speed using standard 1500byte frames.


Agreed I get 115-120mb/sec over SMB with 9kb Jumbo Frames between my Nexenta CE SAN and my Windows 7 workstation.
:Indigo: 33mhz R3k/48mb/XS24 :Indy: 150mhz R4400/256mb/XL24 :Fuel: 600mhz R14kA/2gb/V10 Image 8x1.4ghz Itanium 2/8GB :O3x08R: 32x600mhz R14kA/24GB :Tezro: 4x700mhz R16k/8GB/V12/DCD/SAS/FC/DM5 (2x) :O3x0: 4x700mhz R16k/4GB :PrismDT: 2x1.6ghz 8mb/12gb/SAS/2xFGL

User avatar
jan-jaap
Posts: 4191
Joined: Thu Jun 17, 2004 11:35 am
Location: Wijchen, The Netherlands

Re: O350 crashed with PCI parity error

Unread postby jan-jaap » Fri Feb 24, 2012 6:02 am

ramq wrote:Since I have no IRIX hardware nearby at the moment I can't verify this, but have you tried the async option?

Nope, I didn't set async on the server (IRIX) side. Probably worth investigating :mrgreen:

System has been been humming along 24/7 (as usual) for the past two weeks, must have transferred a couple of TB over the network and the PID counter is well on it's way to 3 million. I guess it really was a spurious event.
Now this is a deep dark secret, so everybody keep it quiet :)
It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi

Currently in commercial service: Image :Onyx2:(2x) :O3x02L:
In the museum: almost every MIPS/IRIX system.
Wanted: GM1 board for Professional Series GT graphics (030-0076-003, 030-0076-004)


Return to “SGI: Hardware”

Who is online

Users browsing this forum: No registered users and 1 guest