Machine Dependencies Committee
The Machine Dependencies Committee was formed in 1996 to determine
the dependencies that exist in the SCD machine room to:
- Recover from a complete machine room shutdown or crash
- Certify and document the correct sequence of boot procedures
for all machines and networks
- Determine the basic production environment for cases when
complete recovery cannot be obtained
- Define the full production environment
The first step was to define the ongoing boot time dependencies
of every system in the room. This effort incorporated a review
of network dependencies, the definition of various inter-system
dependencies, and analysis of how these dependencies change as
systems are installed or removed.
Work completed in FY97
The committee generated a "supercomputing-centric"
boot time dependencies list and
diagram as an initial
prototype for the development of a full, hierarchical computing
environment boot time dependencies diagram. This initial draft
attempts to identify predecessor/successor relationships by
accounting for the prerequisites (or dependencies) of these
aspects of the system:
- Facility and environmental infrastructure (power, chilled
water, air conditioning, etc.)
- Networks and network infrastructure
- Phoenix systems (currently providing domain name service,
license service, and user authentication service)
- Mass storage equipment
- Supercomputers
- DCE server (MSS metadata command service)
- File servers
- Special-purpose servers (job submission, dial-up, home
directories, MIGS, etc.)
The committee determined that dependencies, particularly
for supercomputer systems, could be subcategorized as:
- Those required to power up the system and peripherals
- Those required to boot the system
- Those required to bring the system to a level where systems
personnel could use the system (i.e. restricted access)
- Those required for full, user-level production
Phoenix system
The committee developed the phoenix system concept as a
basic tool to be used in the recovery of the machine room.
The phoenix machines are the starting point for recovery
once power, cooling, and the network infrastructure have
been certified to be functional. The most important services
requiring continuous availability are defined as those
computers upon which all other systems in the NCAR
environment depend. The phoenix system is the set of
computers defined by the
Machine Dependencies Committee
that needs to be organized in a
high availability configuration.
The concept of hot spare backup systems, which was already
in place for user authentication services, was extended to
include one other critical service: the Domain Name System
(DNS), which is needed for machine address resolution. These
two critical services were moved onto two new systems (one
active and one hot spare) that were specifically purchased
for this purpose. The two independent machines in the phoenix
system run the most up-to-date hardware and software. Each
machine has mirrored disk drives and redundant network
connections. Their configuration prevents any single disk
drive, network interface, or network segment failure from
bringing down the phoenix system.
Procedures have been established that will bring the hot
spare into production quickly in the event of a failure of
the primary system. The phoenix systems are designed to be
100% independent of any other computing systems, as shown
in the
"Boot time dependency
diagram."
The Distributed Services Group plans to move other critical
services (e.g. software license servers) onto the phoenix
systems. They are also in the process of making other critical
services (e.g. NFS service and workstation boot service) more
reliable using multiple hot spare systems that can be brought
quickly into service if the primary system fails.
Computer Production Group (CPG) contributions
CPG constructed the "Boot time dependency diagram" and
updated the machine boot documentation supplied by the
Dependency Committee. CPG used this information during
a scheduled facility power down on October 12, 1996. The
Uninterruptible Power Supply allowed many of the machines
to remain in production, but because the air conditioning
was not available, parts of the machine room began to heat
up. This required CPG staff to power down machines in various
parts of the room. The diagram served as a decision-making
tool that helped the staff shut down machines throughout the
room without affecting critical services.
| Next page |
Top of this section |
Table of contents |
| NCAR |
UCAR |
NSF |
NCAR FY97 ASR |