SCD FY97 Annual Scientific Report

Machine Dependencies Committee

The Machine Dependencies Committee was formed in 1996 to determine the dependencies that exist in the SCD machine room to:
  1. Recover from a complete machine room shutdown or crash

  2. Certify and document the correct sequence of boot procedures for all machines and networks

  3. Determine the basic production environment for cases when complete recovery cannot be obtained

  4. Define the full production environment

The first step was to define the ongoing boot time dependencies of every system in the room. This effort incorporated a review of network dependencies, the definition of various inter-system dependencies, and analysis of how these dependencies change as systems are installed or removed.

Work completed in FY97

The committee generated a "supercomputing-centric" boot time dependencies list and diagram as an initial prototype for the development of a full, hierarchical computing environment boot time dependencies diagram. This initial draft attempts to identify predecessor/successor relationships by accounting for the prerequisites (or dependencies) of these aspects of the system:
  1. Facility and environmental infrastructure (power, chilled water, air conditioning, etc.)
  2. Networks and network infrastructure
  3. Phoenix systems (currently providing domain name service, license service, and user authentication service)
  4. Mass storage equipment
  5. Supercomputers
  6. DCE server (MSS metadata command service)
  7. File servers
  8. Special-purpose servers (job submission, dial-up, home directories, MIGS, etc.)

The committee determined that dependencies, particularly for supercomputer systems, could be subcategorized as:

  1. Those required to power up the system and peripherals
  2. Those required to boot the system
  3. Those required to bring the system to a level where systems personnel could use the system (i.e. restricted access)
  4. Those required for full, user-level production

Phoenix system

The committee developed the phoenix system concept as a basic tool to be used in the recovery of the machine room. The phoenix machines are the starting point for recovery once power, cooling, and the network infrastructure have been certified to be functional. The most important services requiring continuous availability are defined as those computers upon which all other systems in the NCAR environment depend. The phoenix system is the set of computers defined by the Machine Dependencies Committee that needs to be organized in a high availability configuration.

The concept of hot spare backup systems, which was already in place for user authentication services, was extended to include one other critical service: the Domain Name System (DNS), which is needed for machine address resolution. These two critical services were moved onto two new systems (one active and one hot spare) that were specifically purchased for this purpose. The two independent machines in the phoenix system run the most up-to-date hardware and software. Each machine has mirrored disk drives and redundant network connections. Their configuration prevents any single disk drive, network interface, or network segment failure from bringing down the phoenix system.

Procedures have been established that will bring the hot spare into production quickly in the event of a failure of the primary system. The phoenix systems are designed to be 100% independent of any other computing systems, as shown in the "Boot time dependency diagram."

The Distributed Services Group plans to move other critical services (e.g. software license servers) onto the phoenix systems. They are also in the process of making other critical services (e.g. NFS service and workstation boot service) more reliable using multiple hot spare systems that can be brought quickly into service if the primary system fails.

Computer Production Group (CPG) contributions

CPG constructed the "Boot time dependency diagram" and updated the machine boot documentation supplied by the Dependency Committee. CPG used this information during a scheduled facility power down on October 12, 1996. The Uninterruptible Power Supply allowed many of the machines to remain in production, but because the air conditioning was not available, parts of the machine room began to heat up. This required CPG staff to power down machines in various parts of the room. The diagram served as a decision-making tool that helped the staff shut down machines throughout the room without affecting critical services.


| Next page | Top of this section | Table of contents |

| NCAR | UCAR | NSF | NCAR FY97 ASR |