Chapter 7. Basic Troubleshooting

This chapter contains hardware-specific information that can be helpful if you are having trouble with your Silicon Graphics Onyx2 graphics rack system.

It is intended to give you some basic guidelines to help keep your hardware and the software that runs on it in good working order.

General Guidelines

To keep your system in good running order, follow these guidelines:

  • Do not enclose the system in a small, poorly ventilated area (such as a closet), crowd other large objects around it, or drape anything (such as a jacket or blanket) over the system.

  • Do not connect cables or add other hardware components while the system is turned on.

  • Do not leave either the graphics or compute module front panel key switches in the diagnostic position during normal operation.

  • Do not power off the system frequently; leave it running over nights and weekends, if possible. If a system console terminal is installed, it can be powered off when it is not being used.

  • Do not place liquids, food, or extremely heavy objects on the system or keyboard.

  • Ensure that all cables are plugged in completely.

  • Ensure that the system has power surge protection.

Operating Guidelines

When your system is up and running, follow these operational guidelines:

  • Do not turn off power to a system that is currently started up and running software.

  • Do not use the root account unless you are performing administrative tasks.

  • Make regular backups (weekly for the whole system, nightly for individual users) of all information.

  • Keep two sets of backup tapes to ensure the integrity of one set while doing the next backup.

  • Protect the root account with a password.

  • Check for root UID = 0 accounts (for example, diag) and set passwords for these accounts.

  • Consider giving passwords to courtesy accounts such as guest and lp.

  • Look for empty password fields in the /etc/passwd file.

If the behavior of your system is marginal, or faulty, first do a physical inspection using the checklist below. If all of the connections seem solid, go to Chapter 6 and use the System Controllers to try to isolate the problem. If the problem persists, run the diagnostic tests from the System Maintenance menu or PROM Monitor. See the IRIX Admin: System Configuration and Operation manual for more information about diagnostic tests.

Check every item on this list:

  • The terminal, PDU, and module System Controller (MSC) power switches are turned on.

  • The module chassis power switches are all in the On position.

  • The fans are running and the fan inlets/outlets are not blocked.

  • The multimodule System Controller (MMSC) for display of a fault message or warning.

Before you continue with power and I/O cable checks, shut down the system and turn off the power.

See “Powering Off the System” in Chapter 4 if you are unsure of the complete process for bringing the system down.

Check all of the following cable connections:

  • The terminal power cable is securely connected to the terminal at one end and the power source at the other end.

  • The power cables are securely connected to the main units at one end or plugged into the proper AC outlet at the other end.

  • The Ethernet cable is connected to the connector port labeled Ethernet.

  • Serial port cables are plugged in securely to their corresponding connectors.

  • All cable routing is safe from foot traffic.

If you find any problems with hardware connections, have them corrected before you restore power to the system. The MMSC may help to determine if internal system problems exist.

If these procedures do not help, contact your system administrator or service provider.

Module Power Supply Problems

The module power supplies in your graphics rack system are not considered end-user replaceable components. There are certain basic checks you can make to determine if a system problem is related directly to a module power supply.

If the module does not power on at all, check the following:

  • Confirm that the module's circuit breaker is up (in the On position).

  • Check to make sure the power cable is firmly plugged in at both the system connector and the wall socket.

  • Remove the plastic front cover (facade) and confirm that the cable connecting the power supply to the fan tray is secure.

In some cases the module's power supply may be unable to supply enough voltage to meet system requirements. If the MSC indicates a power supply related problem, you can remove the front cover and check the status of the three LEDs on the front of the power supply. For help on properly removing the front cover, see “Removing a Module's Facade” in Chapter 5.

Amber (Yellow) LED

The amber LED on the power supply (also known as the AC_OK indicator) lights when the AC input voltage is applied and the system circuit breaker is in the On position.

If the amber LED is not lit, check the following:

  • AC outlets

  • system power cords and power switch

  • fan tray to power supply cable

If none of these items is a problem, check the other LEDs on the power supply for any indications.

Green LED

The green LED indicator (also known as the Power Good indicator) lights when power supply outputs are within specification.

If this LED starts to blink on and off, it is a warning that the supply is overloaded. In this case, contact your service provider for information and assistance.

Red LED

The red LED (also known as the Fault indicator) lights up whenever the power supply shuts off because of insufficient air flow, or when a system over-temperature shutdown occurs.

A blinking condition on this LED indicates that an undervoltage condition exists. It means that the supply has dropped below acceptable limits in either the +3.45, +5, or +12 volt ranges. The supply can be reset by power-cycling the system. Note that this could be a symptom of other problems; contact your service provider for additional information.

Crash Recovery

To minimize data loss from a system crash, back up your system daily and verify the backups. Often a graceful recovery from a crash depends upon good backups.

Your system may have crashed if it fails to boot or respond normally to input devices such as the keyboard. The most common form of system crash is terminal lockup—your system fails to accept any commands from the keyboard. Sometimes when a system crashes, data is damaged or lost.

Before going through a crash recovery process, check your terminal configuration and cable connections. If everything is in order, try accessing the system remotely from another workstation or from the system console terminal (if present).

If none of the solutions in the previous paragraphs is successful, you can fix most problems that occur when a system crashes by using the methods described in the following paragraphs. You can prevent additional problems by recovering your system properly after a crash.

The following list presents several ways to recover your system from a crash. The simplest method, rebooting the system, is presented first. If that fails, go on to the next method, and so on. Here is an overview of the different crash recovery methods:

Rebooting the System

Rebooting usually fixes problems associated with a simple system crash.

Restoring System Software

If you do not find a simple hardware connection problem and you cannot reboot the system, a system file might be damaged or missing. In this case, you need to copy system files from the installation tapes to your hard disk. Some site-specific information might be lost.

Restoring from Backup Tapes

If restoring system software fails to recover your system fully, you must restore from backup tapes. Complete and recent backup tapes contain copies of important files. Some user- and site-specific information might be lost. Read the following section for information on file restoration.

Restoring a Filesystem from the System Maintenance Menu

If your root filesystem is damaged and your system cannot boot, you can restore your system from the Recover System option on the System Maintenance Menu. This is the menu that appears when you interrupt the boot sequence before the operating system takes over the system. To perform this recovery, you need two things:

  • Access to a CD that contains the IRIX release on your system.

  • A full system backup tape (beginning in the root directory (/) and containing all the files and directories on your system) created using the Backup and Restore Manager.

If you do not have a full system backup made with the Backup command or Backup and Restore window—and your root or usr filesystems are so badly damaged that the operating system cannot boot—you have to reinstall your system software and then read your backup tapes (made with any backup tool you prefer) over the freshly installed software.

You may also be able to restore filesystems from the miniroot. For example, if your root filesystem has been corrupted, you may be able to boot the miniroot, unmount the root filesystem, and then use the miniroot versions of restore, xfs_restore, Restore, bru, cpio, or tar to restore your root filesystem.

To recover from system corruption using the Recover System option on the System Maintenance Menu, follow these steps:

  1. When you first start up your machine or press the Reset button on the system, this message appears:

    Starting up the system...
    

    Click the Stop for Maintenance button or press Esc to bring up the System Maintenance menu.

  2. Click the Recover System icon in the System Maintenance menu, or type 4.

    This System Recovery menu appears or you see a graphical equivalent:

                             System Recovery...
     
                     Press <Esc> to return to the menu.
     
    1) Remote Tape  2) Remote Directory  3) Local CD-ROM  4) Local Tape  
     
    Enter 1-4 to select source type, <esc> to quit,
    or <enter> to start: 
    

  3. Enter the menu item number or click the appropriate drive icon for the IRIX release CD or software distribution directory you plan to use.


    Note: With the release of IRIX 6.x, the Remote Tape and Local Tape options on the System Recovery window are no longer usable because bootable (miniroot) software distribution tapes are no longer supported.


    • If you have a CD-ROM drive connected to your system, enter 3 or click the Local CD-ROM icon, then click Accept to start.

      You then see a notifier prompting you to insert the media into the drive. Insert the IRIX CD that came with your system, then click Continue.

    • You can use a drive that is connected to another system on the network. At the System Recovery menu, enter 2 or click the Remote Directory icon.

      When a notifier appears asking you for the remote hostname, type the system's name, a colon (:), and the full pathname of the CD-ROM drive, followed by /dist. For example, to access a CD-ROM drive on the system mars, you would type:

      mars:/CDROM/dist 
      

      Click Accept on the notifier window, then click Accept on the System Recovery window.

      On systems without graphics, you are prompted for the host as above, then you see this menu:

      1) Remote Tape 2)[Remote Directory] 3) Local CD-ROM 4) Local Tape  
            *a) Remote directory /CDROM/dist from server mars.
       
      Enter 1-4 to select source type, a to select the source, <esc> to quit,
      or <enter> to start: 
      

      Press Enter.

    • If you are using a remote software distribution directory, enter 2 or click the Remote Directory icon.

      When a notifier appears that asks you to enter the name of the remote host, type the system's name, a colon (:), and the full pathname of the software distribution directory. For example:

      mars:/dist/6.2 
      

      Click Accept on the notifier window, then click Accept on the System Recovery window.

      On systems without graphics, you are prompted for the host as above, then you see this menu:

      1) Remote Tape 2)[Remote Directory] 3) Local CD-ROM 4) Local Tape  
            *a) Remote directory /dist/6.2 from server mars.
       
      Enter 1-4 to select source type, a to select the source, <esc> to quit,
      or <enter> to start: 
      

      Press Enter.

  4. The system begins reading recovery and installation from the CD. It takes approximately five minutes to copy the information that it needs. After everything is copied from the CD or remote directory to the system disk you see messages including:

    ************************************************************
    *                                                          *
    *                    CRASH    RECOVERY                     *
    *                                                          *
    ************************************************************
     
    You may type  sh  to get a shell prompt at most questions
     
    Checking for tape devices
    

    The next message asks for the location of the tape drive that you will use to read a system backup tape you created prior to the system crash using the Backup and Restore tool on the System menu of the System Toolchest or using the Backup(1) script.

  5. If you have a local tape device, you see this message:

    Restore will be from tapename.  OK? ([y]es, [n]o): [y]
    

    tapename is the name of the local tape device. Answer y if this is the correct tape drive and n if is not.

  6. If you have a remote (network) tape device, no tape device was found, or you answered “no” to the question in the previous step, you see this message:

    Remote or local restore ([r]emote, [l]ocal): [l]
    

    • If you answer “remote,” you have chosen to restore from the network, and you are then asked to enter the following information: the hostname of the remote system, the name of the tape device on the remote system, the IP address of the remote system, and the IP address of your system. The IP address must consist of two to four numbers, separated by periods, such as 192.0.2.1

    • If you answer “local,” you have chosen a tape device that is connected to your system, and you are then asked to enter the name of the tape device.

  7. When you see the following message, insert your most recent full backup tape, then press Enter.

    Insert the first Backup tape in the drive, then 
    press (<enter>, [q]uit (from recovery), [r]estart):
    

  8. There is a pause while the program identifies the filesystems on the tape and attempts to mount those filesystems under /root. Then you see this message:

    Erase all old filesystems and make new ones (y, n, sh): [n]
    

    You have three choices:

    • Answer n for no. After additional prompts confirming the filesystems to be read, the files on the tape are extracted. The version of each file on the tape replaces the version, if any, on the disk even if the version on the disk is newer.

    • Answer y for yes. After additional confirming prompts and prompts about filesystem types, the system erases all of the filesystems and copies everything from your backup tape to the disk.

    • Answer sh to escape to a shell. You are now in the miniroot environment and can investigate the damage to the system or attempt to save files that have been created or modified since the backup tape was created. After exiting the shell, you have the opportunity to remake filesystems and/or read the backup tape.

  9. After reading the full backup tape, this prompt gives you the opportunity to read incremental backup tapes:

    Do you have incremental backup tapes to restore ([y]es, [n]o (none)): [n] 
    

    Insert another tape and answer y if you have additional tape, answer n otherwise.

  10. This prompt gives you the opportunity to reboot your system if recovery is complete, begin the crash recovery process again at the beginning, or re-read your first backup tape:

    Reboot, start over, or first tape again? ([r]eboot, [s]tart, [f]irst) [r] 
    

    If you are ready to reboot, answer r, otherwise choose start or first.

From time to time you may experience a system crash due to file corruption. Systems cease operating (“crash”) for a variety of reasons. Most common are software crashes, followed by power failures of some sort, and least common are actual hardware failures. Regardless of the type of system crash, if your system files are lost or corrupted, you may need to recover your system from backups to its pre-crash configuration.

Once you repair or replace any damaged hardware, you are ready to recover the system. Regardless of the nature of your crash, you should reference the information in the section “Restoring a Filesystem from the System Maintenance Menu” in the IRIX Admin: Backup, Security, and Accounting manual.

The System Maintenance Menu recovery command is designed for use as a full backup system recovery. After you have done a full restore from your last complete backup, you may restore newer files from incremental backups at your convenience. This command is designed to be used with archives made using the Backup(1) utility or through the System Manager. The System Manager is described in detail in the Personal System Administration Guide. System recovery from the System Maintenance Menu is not intended for use with the tar(1), cpio(1), dd(1), or dump(1) utilities. You can use these other utilities after you have recovered your system.

You may also be able to restore filesystems from the miniroot. For example, if your root filesystem has been corrupted, you may be able to boot the miniroot, unmount the root filesystem, and then use the miniroot version of restore, xfs_restore, bru, cpio, or tar to restore your root filesystem. Refer to the man pages on these commands for details on their application.

Refer to the IRIX Admin: System Configuration and Operation manual for instructions on good general system administration practices.