SCD FY97 Annual Scientific Report

High availability of distributed systems

The users of NCAR's computers demand more computing system reliability than they have in the past. It is no longer acceptable for critical services such as electronic mail or Domain Name Service (DNS) to be unavailable for long periods of time. It is vital to protect such services with more robust systems configurations that can sustain hardware and software failures without disrupting service to the users. A high availability configuration ensures continued use of or access to a critical service despite system failures in either hardware or software.

SCD has established a Machine Dependencies Committee that has reviewed the machine dependencies at the Mesa Lab and targeted possible single points of failure. The two most important services requiring continuous availability are DNS and user authentication, since all other systems in the NCAR environment depend on these services. To ensure continuous availability of DNS and authentication services, the Machine Dependencies Committee recommended a high availability configuration that would contain these services (code named the "Phoenix Project"). The Phoenix Project was placed into operation during FY97. The licensing service, provided by an obsolete Sun server, was moved to the Phoenix Project during the summer. Licensing service provides licenses for commercial software applications, such as FrameMaker and Sun compilers.

Other single points of failure were identified by the SCD Executive Committee and Machine Dependencies Committee. After DNS, the most critical service is Network File Service (NFS): it affects a majority of systems in the Mesa Lab machine room. The file service was migrated from the old Auspex 5500 to an SGI Challenge XL. The Challenge XL uses RAID storage devices, exclusively, to ensure continued data reliability and user access to large volumes of data. To provide high availability to file service, a second Challenge XL server was purchased with funding from the Director's Reserve fund, CGD, and SCD. The backup file server's primary function will be to provide a large storage space to conduct data processing. Its secondary function will be to act as a hot spare to the primary file server.

The Foothills Lab also makes use of DNS service, and its users are heavily dependent on the server being in continuous operation. The Foothills Lab high availability system should be online by the end of calendar 97. Other measures have been used to increase availability of services within SCD, such as adding RAIDs on numerous servers; use of dual power supplies on servers; and mirroring of system disks.

The Distributed Systems Group intends to expand the use of the high availability configuration as funding is available, by having MIGS and IRJE back each other up in the case of a system failure, while still performing their primary services. Other candidates for high availability configuration are the internal and external WWW servers. The new gateway security server, which will monitor network activity for security violations, will also be set up in a high availability configuration.

The Office Systems Group's Wintel server configuration will also take advantage of a high availability configuration to assure uninterrupted Wintel client service. The servers run Microsoft NT server software, which supports the use of high availability in a production environment. The Microsoft failover software, unlike UNIX high availability, has been massively deployed throughout the computer industry and is extremely reliable.


| Next page | Top of this section | Table of contents |

| NCAR | UCAR | NSF | NCAR FY97 ASR |