Nekochan Net

Official Chat Channel: #nekochan // irc.nekochan.net
It is currently Wed Jul 23, 2014 4:17 pm

All times are UTC - 8 hours


Forum rules


Any posts concerning pirated software or offering to buy/sell/trade commercial software are subject to removal.



Post new topic Reply to topic  [ 15 posts ] 
Author Message
Unread postPosted: Wed Feb 08, 2012 6:45 am 
Offline
User avatar

Joined: Thu Jun 17, 2004 10:35 am
Posts: 3829
Location: Wijchen, The Netherlands
Today my O350 did something it never did running 24/7 for the last year: it crashed.

Reset from the L2, SYSLOG reveals:
Code:
Feb  8 14:24:23 6D:speedo sn0log: The following are messages stored in the flashlog from a previous system boot.
Feb  8 14:24:23 6D:speedo sn0log: Flashlog for /hw/module/001c01/node/hub/mon
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: HARDWARE ERROR STATE:
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +  Errors on node Nasid 0x0 (0)
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +    IO Board in /hw/module/001c01/io widget: 0xf serial: 
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +      Bridge ASIC errors:
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +        Bridge interrupt status register: 0x5000
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +          INT_N status: 0x0
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +          12: PCI device reported parity error
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +        PCI Error Upper Address Register: 0xb360001
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +        PCI Error Lower Address Register: 0x520bb680
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +          14: PCI Bridge detected parity error
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +        PCI Error Upper Address Register: 0xb360001
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +        PCI Error Lower Address Register: 0x520bb680
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +    IO Board in /hw/module/001c01/io widget: 0xf serial: 
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +      Bridge ASIC errors:
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +        Bridge interrupt status register: 0x5000
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +          INT_N status: 0x0
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +          12: PCI device reported parity error
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +        PCI Error Upper Address Register: 0xb360001
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +        PCI Error Lower Address Register: 0x520bb680
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +          14: PCI Bridge detected parity error
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +        PCI Error Upper Address Register: 0xb360001
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +        PCI Error Lower Address Register: 0x520bb680
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +  Errors on node Nasid 0x1 (1)
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +    IP35 in /hw/module/001c02/node [serial number MTA291]
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +      BEDROCK signalled following errors.
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +        BEDROCK PI 1 Error Interrupt Register: 0x100000
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: +          20: CPU B received uncorrectable error during uncached load
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: End Hardware Error State
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: ++FRU ANALYSIS BEGIN
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: No rules triggered:  Insufficient data
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: 
Feb  8 14:24:23 5D:speedo sn0log: Timeout Histogram is empty.
Feb  8 14:24:23 5D:speedo sn0log: 
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: ++FRU ANALYSIS END
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: PANIC: CPU 2: PCI Bridge Error interrupt killed the system
Feb  8 14:24:23 5D:speedo sn0log: C Fatal: 
Feb  8 14:24:23 5D:speedo sn0log: Dumping to /hw/module/001c01/IXbrick/xtalk/15/pci-x/0/3/scsi_ctlr/0/target/1/lun/0/disk/partition/1/block at block 0, space: 0x2000 pages
Feb  8 14:24:23 6D:speedo sn0log: End of flashlog for /hw/module/001c01/node/hub/mon
Feb  8 14:24:23 6D:speedo sn0log: End of flashlog messages.


At the time of the crash, I was busy copying a large TAR file over NFS to the system. Network is on an NC7770 in module1, disks are attached to an LS1064 in module2. I was about 40GB into the file copy.

I'm not using the ethernet port of the IO9, and while the system disk is attached to the IO9, it should have been more or less idle at the time of the crash. It certainly managed to dump to it, for what it's worth.

I have since copied another 50GB file over to the system; no problems. Should I start shopping around for an IO9, or am I simply the victim of a cosmic ray or other singularity?

Oh, and what's widget 0xf on the io board?

_________________
Now this is a deep dark secret, so everybody keep it quiet :)
It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi

Currently in commercial service: Image :Onyx2:(2x) :O3x02L:
In the museum: almost every MIPS/IRIX system.
Wanted: GM1 board for Professional Series GT graphics (030-0076-003, 030-0076-004)


Top
 Profile  
 
Unread postPosted: Wed Feb 08, 2012 9:49 am 
Offline
User avatar

Joined: Wed Mar 02, 2011 1:37 am
Posts: 408
Location: London - UK
Huuum... I guess that if I were you I would wait to see what happens in the future. You may or may not suffer the problem again and maybe you are worried without reason :).

_________________
Image _Betty Blue_
R12000A 400 Mhz; 1 Gb RAM; 72 Gb 15K HDD; IRIX 6.5.29
CrystalEyes; Dial Box; O2Cam "ZEYE"; external Toshiba SD-M1711 DVD-ROM; Octane speakers;
Lock bar; SGI microphone.
Mods: PSU Noctua fan; internal Toshiba SD-M1401 DVD-ROM; Adaptec AIC-7880P SCSI card.

_REKIEM_I7_
Seasonic X 1250W PSU / Intel I7 2600k 4 x 5,00 Ghz / 2 x Gainward 2Gb GTX 560Ti Phantom 2 / 32 Gb DDR3 / Intel x25-M 160 Gb SSD and 10 extra Tb
_Lazarus_
2 x Intel Xeon MP Gallatin 3,00 Ghz with 4 MB cache / Zotac 512Mb GT430 / 12 Gb DDR266 ECC / 4 x Maxtor Atlas 146GB 10K V U320


Top
 Profile  
 
Unread postPosted: Wed Feb 08, 2012 11:59 am 
Offline
User avatar

Joined: Fri Apr 01, 2011 6:45 am
Posts: 71
... interesting, maybe you remember, I had a similar crash copying around 200 GB over NFS to my tezro.
http://forums.nekochan.net/viewtopic.php?f=3&t=16726049

I think it is a NFS issue.


Top
 Profile  
 
Unread postPosted: Wed Feb 08, 2012 12:19 pm 
Offline
User avatar

Joined: Thu Jun 17, 2004 10:35 am
Posts: 3829
Location: Wijchen, The Netherlands
I tried a 50GB file using FTP, that worked. I have another 200GB file waiting, I'll try tomorrow with NFS and FTP.

I'm not sure I ever used NFS for such large transfers, normally I either use samba or FTP

_________________
Now this is a deep dark secret, so everybody keep it quiet :)
It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi

Currently in commercial service: Image :Onyx2:(2x) :O3x02L:
In the museum: almost every MIPS/IRIX system.
Wanted: GM1 board for Professional Series GT graphics (030-0076-003, 030-0076-004)


Top
 Profile  
 
Unread postPosted: Wed Feb 08, 2012 1:15 pm 
Offline
User avatar

Joined: Fri Oct 09, 2009 1:44 am
Posts: 251
Location: Orgerus (France)
jan-jaap wrote:
Oh, and what's widget 0xf on the io board?

It's the PIC PCI-X controller for the on-board IO9 and the PCI slots.

_________________
:Indigo:R4000 :Indigo2:R4400 :Indigo2IMP:R4400 :Indigo2:R8000 :Indigo2IMP:R10000 :Indy:R4000PC :Indy:R4000SC :Indy:R4600 :Indy:R5000SC :O2:R5000 :O2:RM7000 :Octane:2xR10000 :Octane:R12000 :O200:2xR12000 :O200: - :O200:2x2xR10000 :Fuel:R16000 :A350:
among more than 150 machines : Apollo, Data General, Digital, HP, IBM, MIPS before SGI, Motorola, NeXT, SGI, Solbourne, Sun...


Top
 Profile  
 
Unread postPosted: Wed Feb 08, 2012 1:24 pm 
Offline
User avatar

Joined: Fri Oct 09, 2009 1:44 am
Posts: 251
Location: Orgerus (France)
jan-jaap wrote:
Should I start shopping around for an IO9, or am I simply the victim of a cosmic ray or other singularity?

I'd vote for the cosmic ray.

The PIC PCI address error register doesn't make much sense - it is a 64 bit memory space address which does not even remotely look within the range of addresses IRIX (or the PROM) would set up any device with.

So this looks like a bogusly generated address to me, used as part of a bogus DMA transfer. Of course, if you were not using any device on this module at the time the machine paniced, this is quite suspicious.

_________________
:Indigo:R4000 :Indigo2:R4400 :Indigo2IMP:R4400 :Indigo2:R8000 :Indigo2IMP:R10000 :Indy:R4000PC :Indy:R4000SC :Indy:R4600 :Indy:R5000SC :O2:R5000 :O2:RM7000 :Octane:2xR10000 :Octane:R12000 :O200:2xR12000 :O200: - :O200:2x2xR10000 :Fuel:R16000 :A350:
among more than 150 machines : Apollo, Data General, Digital, HP, IBM, MIPS before SGI, Motorola, NeXT, SGI, Solbourne, Sun...


Top
 Profile  
 
Unread postPosted: Thu Feb 09, 2012 3:11 pm 
Offline

Joined: Mon Nov 22, 2010 12:02 am
Posts: 72
Location: Northern Bavaria, Germany
From PCI 2.3 spec:

"The following requirements also apply when the 64-bit extensions are used.
During address and data phases, parity covers AD[31::00] and C/BE[3::0]# lines
regardless of whether or not all lines carry meaningful information.
...
Parity is generated according to the following rules:
• Parity is calculated the same on all PCI transactions regardless of the type or form.
• The number of "1"s on AD[31::00], C/BE[3::0]#, and PAR equals an even
number.
• Parity generation is not optional; it must be done by all PCI-compliant devices."

In most cases the repetition of a PCI parity error of a PCI card that was working correct for months can be
prevented if the gold finger contacts of the PCI card are cleaned and the mechanical position of the PCI card is checked.
I recommend alcohol for cleaning the gold finger contacts.

_________________
:Fuel: 600 MHz, 2 GB RAM, 72 GB 15k RPM HD
:O2: 180 MHz


Top
 Profile  
 
Unread postPosted: Fri Feb 10, 2012 1:25 am 
Offline
User avatar

Joined: Thu Jun 17, 2004 10:35 am
Posts: 3829
Location: Wijchen, The Netherlands
Whatever it was, it appears to be a spurious error:

* FTPed some 250GB of data from a Linux (client) to the O350 (server): no problems, achieves nearly line speed of the Gbit network (disk systems can keep up at both ends).

* Transferred the same 250GB using NFS. Linux Debian 5 client, mount options 'vers=3,rsize=32768,wsize=32768'. It works, but is very slow. 200GB took 18 hours, 17 minutes and 35 seconds, so that's only ~ 3MB/s :shock:

* Transferred several dozen GB by SMB to the O350. Works pretty well too (~ 85 - 90MB/s, samba 3.6.3 on the O350, Windows 7 client).

If anyone knows the magic NFS arguments to speed up NFS between Linux and IRIX I'd like to hear from you, otherwise I'll forget about NFS. Not for fear of crashing the 350, but because I will be dead before the files are transferred :lol:

_________________
Now this is a deep dark secret, so everybody keep it quiet :)
It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi

Currently in commercial service: Image :Onyx2:(2x) :O3x02L:
In the museum: almost every MIPS/IRIX system.
Wanted: GM1 board for Professional Series GT graphics (030-0076-003, 030-0076-004)


Top
 Profile  
 
Unread postPosted: Fri Feb 10, 2012 1:52 am 
Offline

Joined: Tue Feb 24, 2004 4:10 pm
Posts: 9444
jan-jaap wrote:
If anyone knows the magic NFS arguments to speed up NFS between Linux and IRIX I'd like to hear from you

While you're at it, tips for Windows to Irix via NFS would be good also. Irix <-> Solaris NFS is fast.

Just change everything over to CXFS ?


Top
 Profile  
 
Unread postPosted: Fri Feb 10, 2012 5:32 am 
Offline
User avatar

Joined: Fri Apr 01, 2011 6:45 am
Posts: 71
With IRIX client and CentOS server, I have around 90MB/s in both directions, using:
Code:
exports on CentOS side: rw,async,no_root_squash
mount options IRIX side:  rw,rsize=8192,wsize=8192



Quote:
Just change everything over to CXFS ?

Are you using CXFS?


Top
 Profile  
 
Unread postPosted: Fri Feb 10, 2012 6:19 am 
Offline
User avatar

Joined: Thu Jun 17, 2004 10:35 am
Posts: 3829
Location: Wijchen, The Netherlands
hhoffman wrote:
With IRIX client and CentOS server, I have around 90MB/s in both directions, using:
Code:
exports on CentOS side: rw,async,no_root_squash
mount options IRIX side:  rw,rsize=8192,wsize=8192

You're using a Linux server and an IRIX client, for me it's the other way around. But I'll try to fiddle a bit with the block size.

Quote:
Just change everything over to CXFS ?

Not an option. First of all, CXFS costs real $$$, second, the Linux clients are buildbots, test systems etc. Easily discardable, rather volatile bunch. I don't want to have to deal with complicated things like CXFS here.

A dedicated subnet with jumbo frames would be a good idea, although both SMB and FTP can get close to wire speed using standard 1500byte frames.

_________________
Now this is a deep dark secret, so everybody keep it quiet :)
It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi

Currently in commercial service: Image :Onyx2:(2x) :O3x02L:
In the museum: almost every MIPS/IRIX system.
Wanted: GM1 board for Professional Series GT graphics (030-0076-003, 030-0076-004)


Top
 Profile  
 
Unread postPosted: Fri Feb 10, 2012 8:54 pm 
Offline

Joined: Sat Nov 12, 2011 3:18 am
Posts: 330
Location: Tokyo
Try compiling something and see if it causes the box to crash,
my Octane2 starts to crash when I compile emacs for example.
I suspect in my case it's the SCSI controller - could be due to high number of IO's or something related.

_________________
[click for links to hinv] JP: :Fuel: |:Octane2: |:O2: | :Indy: || PL: [ :Fuel: :O2: :O2+: :Indy: ]


Top
 Profile  
 
Unread postPosted: Sat Feb 11, 2012 5:57 am 
Offline

Joined: Fri Jan 25, 2008 6:06 am
Posts: 767
Location: Sweden
Since I have no IRIX hardware nearby at the moment I can't verify this, but have you tried the async option?

_________________
:O3200: :Fuel: :Indy: :O3x02L:


Top
 Profile  
 
Unread postPosted: Sat Feb 11, 2012 8:35 am 
Offline
User avatar

Joined: Thu Feb 10, 2005 12:37 pm
Posts: 490
Location: Laurel, MD USA
jan-jaap wrote:
hhoffman wrote:
With IRIX client and CentOS server, I have around 90MB/s in both directions, using:
Code:
exports on CentOS side: rw,async,no_root_squash
mount options IRIX side:  rw,rsize=8192,wsize=8192

You're using a Linux server and an IRIX client, for me it's the other way around. But I'll try to fiddle a bit with the block size.

Quote:
Just change everything over to CXFS ?

Not an option. First of all, CXFS costs real $$$, second, the Linux clients are buildbots, test systems etc. Easily discardable, rather volatile bunch. I don't want to have to deal with complicated things like CXFS here.

A dedicated subnet with jumbo frames would be a good idea, although both SMB and FTP can get close to wire speed using standard 1500byte frames.


Agreed I get 115-120mb/sec over SMB with 9kb Jumbo Frames between my Nexenta CE SAN and my Windows 7 workstation.

_________________
:Indigo: 33mhz R3k/48mb/XS24 :Indy: 150mhz R4400/256mb/XL24 :Fuel: 600mhz R14kA/2gb/V10 Image 8x1.4ghz Itanium 2/8GB :O3x08R: 32x600mhz R14kA/24GB :Tezro: 4x700mhz R16k/8GB/V12/DCD/SAS/FC/DM5 (2x) :O3x0: 4x700mhz R16k/4GB :PrismDT: 2x1.6ghz 8mb/12gb/SAS/2xFGL


Top
 Profile  
 
Unread postPosted: Fri Feb 24, 2012 6:02 am 
Offline
User avatar

Joined: Thu Jun 17, 2004 10:35 am
Posts: 3829
Location: Wijchen, The Netherlands
ramq wrote:
Since I have no IRIX hardware nearby at the moment I can't verify this, but have you tried the async option?

Nope, I didn't set async on the server (IRIX) side. Probably worth investigating :mrgreen:

System has been been humming along 24/7 (as usual) for the past two weeks, must have transferred a couple of TB over the network and the PID counter is well on it's way to 3 million. I guess it really was a spurious event.

_________________
Now this is a deep dark secret, so everybody keep it quiet :)
It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi

Currently in commercial service: Image :Onyx2:(2x) :O3x02L:
In the museum: almost every MIPS/IRIX system.
Wanted: GM1 board for Professional Series GT graphics (030-0076-003, 030-0076-004)


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 15 posts ] 

All times are UTC - 8 hours


Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group