Onyx/Tezro - Error: ERROR: SCAN:scan error - unable to reset scan hardware

SGI hardware problems, solutions, tips, hacks, etc.
Forum rules
Any posts concerning pirated software or offering to buy/sell/trade commercial software are subject to removal.
jwhat
Posts: 304
Joined: Sat Aug 09, 2003 6:25 pm
Location: Australia

Onyx/Tezro - Error: ERROR: SCAN:scan error - unable to reset scan hardware

Unread postby jwhat » Fri Oct 20, 2017 12:40 am

Hi Nekochaners,

for no apparent reason I am now getting a boot up failure on one of my Onyx nodes.

This is similar to what was reported in this thread : viewtopic.php?f=3&t=16731353&p=7396316&hilit=scan+error#p7396316


Not sure but speculation was that this was due to voltage regulator problem, so am posting l1 - env results here:

Code: Select all

001c01-L1>* power up
001c01 ERROR: SCAN:scan error - unable to reset scan hardware
001c01-L1>env
Environmental monitoring is enabled and running.

Description    State       Warning Limits     Fault Limits       Current
-------------- ----------  -----------------  -----------------  -------
          1.8V    Enabled  10%   1.62/  1.98  20%   1.44/  2.16    1.78
           12V <not present>
        12V #2    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   12.06
          3.3V    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.32
        12V IO    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   12.06
        5V AUX    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.07
      3.3V AUX    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.30
    PCI 5V AUX    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.07
      PCI 3.3V    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.30
      PCI 2.5V    Enabled  10%   2.25/  2.75  20%   2.00/  3.00    2.50
        PCI 5V    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    4.94
  XIO 12V BIAS <not present>
        XIO 5V <not present>
      XIO 2.5V <not present>
  XIO 3.3V AUX <not present>
 IP59 3.3V AUX    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.30
   IP59 5V AUX    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.07
      IP59 12V    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   12.06
     IP59 VCPU    Enabled  10%   1.15/  1.40  20%   1.02/  1.53    1.27
     IP59 SRAM    Enabled  10%   2.25/  2.75  20%   2.00/  3.00    2.50
     IP59 1.5V    Enabled  10%   1.35/  1.65  20%   1.20/  1.80    1.49

Description     State       Warning RPM  Current RPM
--------------- ----------  -----------  -----------
FAN  0  EXHST 1    Enabled         2160         2327
FAN  1  EXHST 2    Enabled         2160         2343
FAN  2       PS    Enabled         3200         4753
FAN  3    PCI 1    Enabled         2160         2743
FAN  4    PCI 2    Enabled         2160         2986
FAN  5  N0 LEFT    Enabled         2160         3125
FAN  6  N0 CNTR    Enabled         2160         3030
FAN  7 N0 RIGHT    Enabled         2160         3260

                              Advisory   Critical   Fault      Current
Description       State       Temp       Temp       Temp       Temp       
----------------- ----------  ---------  ---------  ---------  --------- 
 0 INTERFACE 0       Enabled   31C/ 87F   48C/118F   55C/131F    1C/ 33F
 1 INTERFACE 1       Enabled   31C/ 87F   48C/118F   55C/131F    3C/ 37F
 2 INTERFACE 2       Enabled   31C/ 87F   48C/118F   55C/131F   11C/ 51F
 3 PCI RISER         Enabled   31C/ 87F   48C/118F   55C/131F   12C/ 53F
 4 ODYSSEY        <not present>
 5 NODE              Enabled   31C/ 87F   48C/118F   55C/131F   10C/ 50F
 6 BEDROCK           Enabled   31C/ 87F   48C/118F   55C/131F   -4C/ 24F

001c01-L1>* power down
001c01-L1>


The only change I have made to the machine, since last boot was removal of some PCI cards:

See l1 pci results:

Code: Select all

001c01-L1>pci
Bus Slot Slot Stat Bus Stat  Power Mode/Speed
--- ---- --------- --------- ----- ----------
  1    1 0x70 0x01      0x00   15W PCI  33MHz
  1    2 0x80 0x0f      0x00  none PCI  33MHz
  2    1 0x00 0x00      0x00  7.5W PCI  33MHz
  2    2 0x00 0x00      0x00  7.5W PCI  33MHz
001c01-L1>


Could any one provide view of what is the problem here.

Thank you.

Cheers from Australia,

jwhat.
jwhat - ask questions, provide answers

jwhat
Posts: 304
Joined: Sat Aug 09, 2003 6:25 pm
Location: Australia

Re: Onyx/Tezro - Error: ERROR: SCAN:scan error - unable to reset scan hardware

Unread postby jwhat » Sun Oct 22, 2017 2:55 am

Hi Nekochaners,

I have not found root cause of this error, but have found a work-around...

Steps I took to try to resolve the problem where:

1. Remove all the non IO9 PCI cards, as only thing I changed was installation of PCI card

Still got error message

So went from starting up via l1 - "* power up" command and use power button on front.

I found that if you held power on botton on for some time during startup then it appeared to result in getting past error (which also appears on l1 display).

Once I got machine to boot I did PROM: update & then reboot.

On hot reboot machine always come up ok, but on cold reboot from l1, I still get the "scan error".

2. I then re-added PCI cards (m-audio and rad audio).

Again got error on l1 based startup but was able to work around it via power on button, but all the cards are identified and l1 PCI show, that bus speed for IO9 is 66, while 2 audio cards are 33.
When I got the original "scan error", all slots in PCI where reporting as 33.

So until I do some further review of logs to try to pin point problem, it appears that work-around is current approach.

Cheer,

jwhat
jwhat - ask questions, provide answers

jwhat
Posts: 304
Joined: Sat Aug 09, 2003 6:25 pm
Location: Australia

Re: Onyx/Tezro - Error: ERROR: SCAN:scan error - unable to reset scan hardware

Unread postby jwhat » Sat Nov 04, 2017 4:46 pm

Hi Nekochaners,

seems like this error is a bit of mystery...

Still trying to diagnose what the cause is.

Here are outputs from L1:

Code: Select all

001c01-L1>serial all

Data                            Location      Value
------------------------------  ------------  --------
Local System Serial Number      NVRAM         M2004476
Reference System Serial Number  Attached C    M2004476
Local Brick Serial Number       EEPROM        REF634
Reference Brick Serial Number   NVRAM         REF634


EEPROM      Product Name    Serial         Part Number           Rev  T/W   
----------  --------------  -------------  --------------------  ---  ------
INTERFACE   2U_INT_53       REF634         030_1809_006          D    000000
IO9         IO9             NZT098         030_1771_006          A    00   
ODYSSEY     no hardware detected
RISER       2U_RISER        NSG552         030_1808_006          C    00   
NODE        IP59_4CPU       RAG401         030_1989_003          C    00   
SNOWBALL    no hardware detected
PS 1        DPS-500EBE      XPD0710000209  060-0178-003          S4
PS 2        DPS-500EBE      XPD0710000057  060-0178-003          S4

EEPROM     JEDEC-SPD Info           Part Number        Rev  Speed  SGI     
---------- ------------------------ ------------------ ---- ------ --------
DIMM 0     7F9800000000000000000000 KSG-ON3000/2048    0000  10.0  N/A     
DIMM 2     7F94FFFFFFFFFFFFB953E80D SM57228DSGI100M    00FF   8.0  N/A     
DIMM 4     7F94FFFFFFFFFFFF81FCD01D SM57228DSGI100M    00FF   8.0  N/A     
DIMM 6     7F94FFFFFFFFFFFF0EFCD01D SM57228DSGI100M    00FF   8.0  N/A     
DIMM 1     7F9800000000000000000000 KSG-ON3000/2048    0000  10.0  N/A     
DIMM 3     7F94FFFFFFFFFFFFD953E80D SM57228DSGI100M    00FF   8.0  N/A     
DIMM 5     7F94FFFFFFFFFFFF0BEF180D SM57228DSGI100M    00FF   8.0  N/A     
DIMM 7     7F94FFFFFFFFFFFF9D70980D SM57228DSGI100M    00FF   8.0  N/A     

001c01-L1>power
Supply          State Voltage    Margin  Value
--------------  ----- ---------  ------- -----
          1.8V    off    0.000V   normal     0
           12V    off    0.063V      N/A
        12V #2    off    0.063V      N/A
          3.3V     NC    0.138V   normal     0
        12V IO     NC    0.063V      N/A
        5V AUX     NC    5.096V      N/A
      3.3V AUX     NC    3.302V      N/A
    PCI 5V AUX     NC    5.096V      N/A
      PCI 3.3V     NC    0.138V      N/A
      PCI 2.5V    off    0.000V   normal     0
        PCI 5V    off    0.000V   normal     0
  XIO 12V BIAS     <not present>
        XIO 5V     <not present>
      XIO 2.5V     <not present>
  XIO 3.3V AUX     <not present>
 IP59 3.3V AUX     NC    3.302V      N/A
   IP59 5V AUX     NC    5.070V      N/A
      IP59 12V     NC    0.063V      N/A
     IP59 VCPU    off    0.000V   normal    15
     IP59 SRAM    off    0.000V   normal     0
     IP59 1.5V    off    0.000V   normal     0

001c01-L1>* power up

001c02 ATTN: Cooling system stabilized
001c01 ERROR: SCAN:scan error - unable to reset scan hardware
001c01-L1>
001c01-L1>log
11/02/17 06:24:32 TODO: enable vsc055 fan interrupts
11/02/17 06:24:32 TODO: enable vsc055 fan interrupts
11/02/17 06:24:32 TODO: enable vsc055 fan interrupts
11/02/17 06:24:32 TODO: enable vsc055 fan interrupts
11/02/17 06:24:45 L1 booting 1.30.6
11/02/17 06:24:46 ChiServ IP59
11/02/17 06:24:46 Checking for Type
11/02/17 06:24:46  -- ChiServ Type set
11/02/17 06:24:48 TODO: enable vsc055 fan interrupts
11/02/17 06:24:48 TODO: enable vsc055 fan interrupts
11/02/17 06:24:48 TODO: enable vsc055 fan interrupts
11/02/17 06:24:48 USB0: waiting on open
11/02/17 06:27:28 SMP unregistering events
11/02/17 06:27:28 UNREG: 30006834 0 4
11/02/17 06:27:29 SMP-R: UART:UART_NO_CONNECTION
11/02/17 06:27:40 B2BR: UART:UART_BREAK_RECEIVED
11/02/17 06:27:40 B2B[1] PPP error UART:UART_BREAK_RECEIVED
11/02/17 06:27:40 B2BR-CTC port: IRouter:read failed - read error
11/04/17 16:48:53 L1 booting 1.30.6
11/04/17 16:48:54 ChiServ IP59
11/04/17 16:48:54 Checking for Type
11/04/17 16:48:54  -- ChiServ Type set
11/04/17 16:48:56 TODO: enable vsc055 fan interrupts
11/04/17 16:48:56 TODO: enable vsc055 fan interrupts
11/04/17 16:48:56 TODO: enable vsc055 fan interrupts
11/04/17 16:48:56 USB0: waiting on open
11/04/17 16:49:09 SMP unregistering events
11/04/17 16:49:09 UNREG: 30006834 0 4
11/04/17 16:49:10 SMP-R: UART:UART_NO_CONNECTION
11/04/17 16:54:08 power up (COMMAND)
11/04/17 16:54:08 TODO: enable vsc055 fan interrupts
11/04/17 16:54:08 TODO: enable vsc055 fan interrupts
11/04/17 16:54:08 TODO: enable vsc055 fan interrupts
11/04/17 16:54:12 TODO: enable vsc055 fan interrupts
11/04/17 16:54:12 TODO: enable vsc055 fan interrupts
11/04/17 16:54:12 TODO: enable vsc055 fan interrupts
11/04/17 16:54:12 TODO: enable vsc055 fan interrupts
11/04/17 16:54:12 TODO: enable vsc055 fan interrupts
11/04/17 16:54:12 TODO: enable vsc055 fan interrupts
001c01-L1>
001c01-L1>fan
fan(s) are on.
fan 0 EXHST 1  rpm 2327
fan 1 EXHST 2  rpm 2327
fan 2 PS       rpm 4753
fan 3 PCI 1    rpm 2721
fan 4 PCI 2    rpm 2986
fan 5 N0 LEFT  rpm 3157
fan 6 N0 CNTR  rpm 3061
fan 7 N0 RIGHT rpm 3260
001c01-L1>
001c01-L1>pci
Bus Slot Slot Stat Bus Stat  Power Mode/Speed
--- ---- --------- --------- ----- ----------
  1    1 0x70 0x01      0x00   15W PCI  33MHz
  1    2 0xa0 0x0f      0x00  none PCI  33MHz
  2    1 0x00 0x00      0x00  7.5W PCI  33MHz
  2    2 0x00 0x00      0x00  7.5W PCI  33MHz
001c01-L1>


Since last time I used machine (I have added second, power supply, so "power" info is now showing up with second 12V supply.
I will read up further on L1 to see if I can get more verbose outputs to help pin point the source of problem.

Also does anyone know why some RAM displays as 8.00 and others as 10.00 ?

In the hinv all RAM is reported as being "Premium".

Cheers from Australia.


jwhat.
jwhat - ask questions, provide answers

mopar5150
Posts: 557
Joined: Tue Apr 24, 2012 6:02 pm
Location: Palm Springs, CA
Contact:

Re: Onyx/Tezro - Error: ERROR: SCAN:scan error - unable to reset scan hardware

Unread postby mopar5150 » Sun Nov 05, 2017 10:20 am

Try a different PCI riser as I have seen this error when switching between PCI and AGP risers in the same chassis. There is something different between the two, and in the past when I put the AGP back into the chassis where it came from the scan error went away.
If the thing isn't on fire it's a software problem.

:Tezro: :O3x0: :A350:

jwhat
Posts: 304
Joined: Sat Aug 09, 2003 6:25 pm
Location: Australia

Re: Onyx/Tezro - Error: ERROR: SCAN:scan error - unable to reset scan hardware

Unread postby jwhat » Mon Nov 06, 2017 1:09 am

Hi John,

Thanks for tip, I was hoping to avoid re-swapping parts...

Will leave untill weekend.

I have been using Linux L3 controller to do diagnostics and see if this comes up with anything in the meantime.

Cheer from Australia.

jwhat.
jwhat - ask questions, provide answers

jwhat
Posts: 304
Joined: Sat Aug 09, 2003 6:25 pm
Location: Australia

Re: Onyx/Tezro - Error: ERROR: SCAN:scan error - unable to reset scan hardware

Unread postby jwhat » Tue Nov 07, 2017 4:37 am

Hi Nekochaners,

good news! (for me anyway ;-) ) .

Today was holiday in Australia (Melbourne Cup) and so after mostly inconclusive testing with L3 software diagnostics I decided I may as well complete IRIX update to 6.5.29 and apply corresponding new l1 software to 4x1GHZ & 4x800MHz chassis's.

IRIX update was simple and smooth and I then archived away 6.5.29 /usr/sbin/flashsc & /usr/cpu/firmware/l1.bin files for prosperity and then applied l1 update via IRIX:
"./flashsc --sc -f l1.bin all" .

While applying the flash via IRIX I also had L2 emulation connected via USB L1 port.

For the first time (in around 10 prior flash attempts) with prior versions the "all" directive went smoothly and update was done across both machines.

Then via L2 emulator, changed default version across rack/slots 1.1 & 1.2:
"1.1 flash default a" and
"1.2 flash default b"
as chassis have different default images.

Then L1 reboots (again via L2):
"l1 reboot_l1"

Then power up using L2 emulator GUI: Power Up button.

Machine booted up cleanly now... no more "scan error".

I have now done a number of stops and restarts via L2 emulator and via L1 connection to Console port, both with and without redundant power supply on problem chassis (having dual power supplies stopped the front panel button work-around from working).

Expect the mopar was right on what could have been contributing cause, but glad to report that after process of first going back to old l1 versions to get same l1 across all machines and then moving to newer versions resulted in resolution of problem.

Both boxes now running: IRIX 6.5.29 with L1 1.40.5

Cheers from Australia.

jwhat.
jwhat - ask questions, provide answers


Return to “SGI: Hardware”

Who is online

Users browsing this forum: No registered users and 3 guests