Chapter 7. Basic Troubleshooting

This chapter contains hardware-specific information that can be helpful if you are having trouble with your SGI 2100 system. This information is provided in addition to the module System Controller (MSC) information provided in the previous chapter.

This chapter is intended to give you some basic guidelines to help keep your hardware and the software that runs on it in good working order.

General Guidelines

To keep your system in good running order, follow these guidelines:

  • Do not enclose the system in a small, poorly ventilated area (such as a closet), crowd other large objects around it, or drape anything (such as a jacket or blanket) over the system.

  • Do not connect cables or add other hardware components while the system is turned on.

  • Do not leave the front panel key switch in the diagnostic position.


    Note: There is clearance provided for the front panel to close while a key is inserted into the MSC. However, the door may snag on any additional keys you have attached to the MSC's main key.


  • Do not lay the system on its side.

  • Do not power off the system frequently; leave it running over nights and weekends, if possible. If a system console terminal is installed, it can be powered off when it is not being used.

  • Do not place liquids, food, or extremely heavy objects on the system.

  • Ensure that all cables are plugged in completely.

  • Ensure that the system has power surge protection.

Operating Guidelines

When your system is up and running, follow these operational guidelines:

  • Do not turn off power to a system that is currently started up and running software.

  • Do not use the root account unless you are performing administrative tasks.

  • Make regular backups (weekly for the whole system, nightly for individual users) of all information.

  • Keep two sets of backup tapes to ensure the integrity of one set while doing the next backup.

  • Protect the root account with a password:

  • Check for root UID = 0 accounts (for example, diag) and set passwords for these accounts.

  • Consider giving passwords to courtesy accounts such as guest and lp.

  • Look for empty password fields in the /etc/passwd file.

If the behavior of your system is marginal, or faulty, first do a physical inspection using the checklist below. If all of the connections seem solid, go to the previous chapter and use the MSC to try and isolate the problem. If the problem persists, run the diagnostic tests from the System Maintenance menu or PROM Monitor. See the IRIX Admin: System Configuration and Operation manual for more information about diagnostic tests.

If this does not help, contact your system administrator or service provider.

Check every item on this list:

  • The terminal and MSC power switches are turned on.

  • The main system power switch is not turned to off.

  • The fans are running and the fan inlets/outlets are not blocked.

  • The MSC display for a fault message or warning.

Before you continue, shut down the system and turn off the power.

Check all of the following cable connections:

  • The terminal power cable is securely connected to the terminal at one end and the power source at the other end.

  • The system power cable is securely connected to the main unit at one end and plugged into the proper AC outlet at the other end.

  • The Ethernet cable is connected to the connector port labeled Ethernet.

  • Serial port cables are plugged securely into their corresponding connectors.

  • All cable routing is safe from foot traffic.

If you find any problems with hardware connections, correct them and turn on the power to the main unit. The MSC may help to determine if internal system problems exist.

Power Supply Problems

The power supply in your SGI 2100 is not considered an end-user replaceable component. There are certain basic checks you can make to determine if a system problem is related directly to the power supply.

If the system will not power on at all, check the following:

  • Confirm that the system circuit breaker is up (in the On position).

  • Check to make sure the power cable is firmly plugged in at both the system connector and the wall socket.

  • Remove the front cover and confirm that the cable connecting the power supply to the fan tray is secure.

In some cases the power supply may be unable to supply enough voltage to meet system requirements. When the MSC indicates a power supply related problem, you can remove the front cover and check the status of the three LEDs on the front of the power supply. For help on properly removing the front cover, see “Removing the System's Plastic Covers” in Chapter 3.

The Amber (Yellow) LED

The amber LED on the power supply (also known as the AC_OK indicator) lights when the AC input voltage is applied and the system circuit breaker is in the On position.

If the amber LED is not lit, you should check the following:

  • The AC outlet

  • The system power cord and power switch

  • The fan tray to power supply cable

If none of these items is a problem, check the other LEDs on the power supply for any indications.

The Green LED

The green LED indicator (also known as the Power Good indicator) lights when power supply outputs are within specification.

If this LED starts to blink on and off, it is a warning that the supply is overloaded. This may indicate a condition such as a 110 volt system that is overloaded with too many node boards or other options. In this case, contact your service provider for information and assistance.

The Red LED

The red LED (also known as the Fault indicator) lights up whenever the power supply shuts off because of insufficient air flow, or when a system over temprature shutdown occurs.

A blinking condition on this LED indicates that an undervoltage condition exists. It means that the supply has dropped below acceptable limits in either the +3.45, +5, or +12 volt ranges. The supply can be reset by power-cycling the system. Note that this could be a symptom of other problems, contact your service provider for additional information.

Crash Recovery

To minimize data loss from a system crash, back up your system daily and verify the backups. Often a graceful recovery from a crash depends upon good backups.

Your system may have crashed if it fails to boot or respond normally to input devices such as the keyboard. The most common form of system crash is terminal lockup—your system fails to accept any commands from the keyboard. Sometimes when a system crashes, data is damaged or lost.

Before going through a crash recovery process, check your terminal configuration and cable connections. If everything is in order, try accessing the system remotely from another workstation or from the system console terminal (if present).

If none of the solutions in the previous paragraphs is successful, you can fix most problems that occur when a system crashes by using the methods described in the following paragraphs. You can prevent additional problems by recovering your system properly after a crash.

The following sections present several ways to recover your system from a crash. The simplest method, rebooting the system, is presented first. If that fails, go on to the next method, and so on. These sections are an overview of the different crash recovery methods.

Rebooting the System

Rebooting usually fixes problems associated with a simple system crash.

Restoring System Software

If you do not find a simple hardware connection problem and you cannot reboot the system, a system file might be damaged or missing. In this case, you need to copy system files from the installation source to your hard disk. Some site-specific information might be lost.

Restoring From Backup Tapes

If restoring system software fails to recover your system fully, you must restore from backup tapes. Complete and recent backup tapes contain copies of important files. Some user- and site-specific information might be lost. Read the following section for information on file restoration.

Restoring a Filesystem From the System Maintenance Menu

If your root filesystem is damaged and your system cannot boot, you can restore your system from the System Maintenance Menu. This is the menu that appears when you interrupt the boot sequence before the operating system takes over the system. To perform this recovery, you need two different tapes: your system backup tape and a bootable tape with the miniroot.

If a backup tape is to be used with the System Recovery option of the System Maintenance Menu, it must have been created with the System Manager or with the Backup command, and must be a full system backup (beginning in the root directory (/) and containing all the files and directories on your system). Although the Backup command is a front-end interface to the bru command, Backup also writes the disk volume header on the tape so that the “System Recovery” option can reconstruct the boot blocks, which are not written to the tape using other backup tools. For information on creating the system backup, see the IRIX Admin: Backup, Security, and Accounting manual.

If you do not have a full system backup made with the Backup command or System Manager —and your root or usr filesystems are so badly damaged that the operating system cannot boot—you have to reinstall your system.

If you need to reinstall the system to read your tapes, install a minimal system configuration and then read your full system backup (made with any backup tool you prefer) over the freshly installed software.

This procedure should restore your system to its former state.


Caution: Existing files of the same pathname on the disk are overwritten during a restore operation, even if they are more recent than the files on tape.


  1. Start the system and you should see a message like the following:

    Starting up the system....
    To perform system maintenance instead, press <Esc>
    

  2. Press the <Esc> key. You see the following menu:

    System Maintenance Menu

    1 Start System

    2 Install System Software

    3 Run Diagnostics

    4 Recover System

    5 Enter Command Monitor

  3. Enter the numeral 4 and press <Return>. You see this message:

    System Recovery...

    Press Esc to return to the menu.

    After a few moments, you see the message:

    Insert the installation tape, then press <Enter>:

  4. Insert your bootable tape and press the <Enter> key. You see some messages while the miniroot is loaded. Next you see the message:

    Copying installation program to disk....
    

    Several lines of dots appear on your screen while this copy takes place.

  5. You see this message:

    CRASH RECOVERY
    You may type sh to get a shell prompt at most questions.
    Remote or local restore: ([r]emote, [l]ocal): [l]
    

  6. Press <Enter> for a local restoration. If your tape drive is on another system accessible by the network, press r and then <Enter>. You are prompted for the name of the remote host and the name of the tape device on that host. If you press <Enter> to select a local restoration, you see this message

    Enter the name of the tape device: [/dev/tape] 
    

    You may need to enter the exact device name of the tape device on your system, since the miniroot may not recognize the link to the convenient /dev/tape filename. As an example, if your tape drive is drive #6 on your integral SCSI bus (bus 0), the most likely device name is /dev/rmt/tps0d6nr. If it is drive #3, the device is /dev/rmt/tps0d3nr.

    The system prompts you to insert the backup tape. When the tape has been read back onto your system disk, you are prompted to reboot your system.

Recovery After System Corruption

From time to time you may experience a system crash caused by file corruption. Systems cease operating (“crash”) for a variety of reasons. Most common are software crashes, followed by power failures of some sort, and least common are actual hardware failures. Regardless of the type of system crash, if your system files are lost or corrupted, you may need to recover your system from backups to its pre-crash configuration.

Once you repair or replace any damaged hardware, you are ready to recover the system. Regardless of the nature of your crash, you should refer to the information in the section “Restoring a Filesystem from the System Maintenance Menu” in the IRIX Admin: Backup, Security, and Accounting manual.

The System Maintenance Menu recovery command is designed for use as a full backup system recovery. After you have done a full restore from your last complete backup, you may restore newer files from incremental backups at your convenience. This command is designed to be used with archives made using the Backup utility or through the System Manager. The System Manager is described in detail in the Personal System Administration Guide. System recovery from the System Maintenance Menu is not intended for use with the tar, cpio, dd, or dump utilities. You can use these other utilities after you have recovered your system.

You may also be able to restore filesystems from the miniroot. For example, if your root filesystem has been corrupted, you may be able to boot the miniroot, unmount the root filesystem, and then use the miniroot version of restore, xfs_restore, bru, cpio, or tar to restore your root filesystem. Refer to the reference (man) pages on these commands for details on their application.

Refer to the IRIX Admin: System Configuration and Operation manual for instructions on good general system administration practices.

MSC Shutdown

Under specific circumstances the MSC may shutdown the system. Usually this occurs when the operating environment becomes too warm due to fan failure, high ambient temperatures, or a combination of the two.

The MSC automatically shuts down the system and lights the “Over Temperature Fault” LED if any of the following situations occur:

  • Failure of two or more of the system's nine fans.

  • Failure of one fan plus a high ambient temperature.

  • Failure of any (critical) fan directly responsible for cooling the power supply or a router board.

  • An unacceptably high ambient temperature.

Only the last situation can be dealt with completely by the end user. The first three require a service call by a qualified support technician.

Fixing the MSC Shutdown

If you determine that a critical fan or fans have failed, you should immediately place a service call. The system is not usable until the faulty fan(s) are replaced.

If the problem involves the combined failure of a single non-critical fan and a high ambient temperature, you should place a service call. You may be able to keep the system running by lowering the ambient temperature of the operating environment while waiting for service.

To lower the ambient temperature around the system, try these methods:

  • Lower the air conditioning temperature.

  • Move the system to a cooler environment.

  • Use a portable fan(s) to circulate more air around the system.

  • Use a portable air-conditioner to lower the temperature of the system.

If the problem is simply a high ambient temperature, you will need to either lower the work environment temperature, or move the system to an area with a lower ambient temperature.

Hardware Graph and hinv Commands

If you are having trouble determining what options and standard components are installed in your SGI 2100, you may wish to use one or several of the commands listed in the next sections.

Hardware Graph Information

The hardware graph is a tool for inventorying the I/O devices of the SGI 2100 system. Unlike hinv, the hardware graph is a UNIX® filesystem, whose branching character accommodates the possibility of multiple nodes, each with multiple I/O devices of several types. The hardware graph keeps track of information in the kernel that is associated with the hardware.

Most of the hardware graph directories are much like their /dev counterparts, but module numbers are persistent across reboots and hardware changes (until you change the module numbers).

To see the hardware graph, use the ls command. For example:

# ls /hw
console    mem       module     rdisk     ttys      scsi_ctlr   unknown
disk       kmem      mmem       null      scsi      ttys        zero

In this output, module, rdisk, ttys, scsi, scsi_ctrl, and ttys are subdirectories containing files. For example:

# ls /hw/ttys
tty4d1  tty4f1  tty4m1  ttyc1   ttyd1   ttyf1   ttym1
tty4d2  tty4f2  tty4m2  ttyc2
# ls /hw/scsi
sc1d2l0
# ls /hw/rdisk
dks1d2s0       dks1d2vh       root           volume_header
dks1d2s1       dks1d2vol      swap
# ls /hw/scsi_ctlr
0  1

To determine I/O devices within a system, follow the directory structure. For example:

# ls /hw/module/1/slot/n4/node/link/cpu
0  1
# ls /hw/module/1/slot/n4/node/link/xtalk
0

hinv Information

Use the hinv command to obtain basic information regarding the general configuration of your system. You should see output similar to the following (although it varies from system to system, depending upon how each system is configured):

# hinv
System SGI-IP27
4 250 MHZ IP27 Processors
Main memory size: 512 Mbytes
Integral SCSI controller 0
    Integral SCSI controller 1
    Integral Fast Ethernet
    IOC3 serial port
        Disk drive: unit 1 on SCSI Controller 0, (dksc(0,1,0))
    >> hinv -v
    IP27 Node Board, Module 1, Slot n1
        ASIC HUB Rev 3, 100 MHz, (nasid 0)
        Processor A: 250 MHz R10000, Rev 3.4, 4M 250MHz secondary cache,   (cpu 0)
          R10000FPC  Rev 0
        Processor B: 250 MHz R10000, Rev 3.4, 4M 250MHz secondary cache, (cpu 1)
          R10000FPC  Rev 0
        Memory on board, 64 MBytes (Standard)
          Bank 0, 64 MBytes (Standard) <-- (Physical Bank 0)
    IP27 Node Board, Module 1, Slot n2
        ASIC HUB Rev 5, 100 MHz, (nasid 1)
        Processor A: 250 MHz R10000, Rev 3.4, 4M 250MHz secondary cache, (cpu 2)
          R10000FPC  Rev 0
        Processor B: 250 MHz R10000, Rev 3.4, 4M 250MHz secondary cache, (cpu 3)
          R10000FPC  Rev 0
        Memory on board, 512 MBytes (Standard)
          Bank 0, 128 MBytes (Standard) <-- (Physical Bank 0)
          Bank 1, 128 MBytes (Standard)
          Bank 2, 128 MBytes (Standard)
          Bank 3, 128 MBytes (Standard)
    BASEIO IO Board, Module 1, Slot io1
        ASIC BRIDGE Rev 3, (widget 8)
        adapter PCI-SCSI Rev 5, (pci id 0)
            peripheral SCSI DISK, ID 1, SGI IBM DORS-32160W
        adapter PCI-SCSI Rev 5, (pci id 1)
        adapter IOC3 Rev 1, (pci id 2)
            controller multi function SuperIO
            controller Ethernet Rev 1
    ASIC XBOW Rev 2, on midplane of Module 1
    ASIC XBOW Rev 2, on midplane of Module 1