Maintaining the existing production supercomputer environment
Even though significant changes have taken place with the integration of
Distributed Shared Memory (DSM) computer systems into both the Climate
Simulation Laboratory (CSL) and Community computational environments during
FY1998, SCD continued to maintain and enhance its existing production
parallel-vector supercomputer (a.k.a. PVP) environment. In FY1999 SCD will
continue to support the established PVP environment, but will focus more
resources on maintaining and enhancing the DSM systems and user environments
that were established as part of the NCAR computational environment in FY1998.
During FY1998 the major changes in the SCD production supercomputer
environment were:
- The installation of a new Cray J90se/24-1024 (chipeta) in the Community
- The installation of a new Silicon Graphics Cray Origin2000/128 (ute)
in the CSL
- The transition of the HP SPP-2000/64 from an HPCC resource to the
Community
During FY1998, SCD integrated new DSM systems into the existing
parallel-vector supercomputer production environment and attempted to
provide as seamless a computational environment for the CSL and Community
users as possible. Though SCD introduced these two DSM computer systems, the
Silicon Graphics Cray Origin2000 (ute) and HP SPP-2000 (sioux), into the
CSL and Community
environments, respectively, SCD is committed to continuing to provide
support of production parallel-vector supercomputer systems at least through
FY1999, and likely well beyond. Historically, it is these parallel-vector
systems that have served the computational needs of NCAR and the
atmospheric/oceanic sciences, and SCD will continue to support these systems
to minimize the impact of the introduction of new computing architectures
and systems.
At the end of FY1998, the "production computational environment" managed by
SCD for NCAR includes five Cray supercomputers (a C90/16 (antero), a J90/20
(aztec), a J90/16 (paiute), and a pair of J90se/24 systems (ouray and
chipeta) -- known generically as "Parallel Vector Processor" or PVP
systems), Distributed Shared Memory supercomputers (the Silicon
Graphics Cray Origin2000 (ute) and HP SPP-2000 (sioux)),
the NCAR Mass Storage system, the HiPPI data communications fabric
and networking facilities, the DataPark (a Silicon Graphics
PowerChallenge XL (winterpark)), a test system (Silicon Graphics
Cray Origin2000 (mouache)),
fileservers, and the Visualization Lab. This environment's reliability is
enhanced by the systems support, operational monitoring, and user services
activities provided by SCD staff.
One of the most important aspects of SCD's attention to maintaining the
existing production supercomputer environments is to provide 7x24 operation
and service. This attention is reflected in the following tables, which
show average system performance and utilization for FY1998:
Average Community supercomputer system performance and utilization
statistics for FY1998
| System | GFLOPS | Utilz'n | User | Idle | System | WaitIO | IOfs | IOswp
|
| chipeta | 1.605 | 92.7% | 95.2% | 1.6% | 3.2% | 0.2% | - | -
|
| ouray | 1.498 | 92.4% | 93.9% | 2.3% | 3.8% | 0.2% | - | -
|
| paiute | 0.871 | 87.0% | 87.9% | 8.4% | 3.7% | 3.7% | - | -
|
| sioux | ~2.0 | 51.5% | 52.5% | 47.5% | - | - | - | -
|
Where "GFLOPS" is the average number of floating point operations per second
(in billions) during the measuring period; "Utilz'n" is the average user
utilization of the system (system downtime counts against utilization);
"User" is the percent of uptime occupied in performing computation for user
processes; "Idle" is the percent of uptime spent idle; "System" is the
percent of uptime consumed in system overhead; "WaitIO" is the percent of
uptime spent awaiting I/O completion; "IOfs" is the percent of the WaitIO
time spent in performing user filesystem I/O; and "IOswp" is the percent of
the WaitIO time spent in performing process swapping/paging.
Average CSL supercomputer system performance and utilization
statistics for FY1998
| System | GFLOPS | Utilz'n | User | Idle | System | WaitIO | IOfs | IOswp
|
| antero | 4.730 | 90.7% | 92.6% | 3.5% | 3.8% | 0.5% | - | -
|
| T3D | ~1.1 | 73.1% | 75.9% | 24.1% | - | - | - | -
|
| aztec | 1.209 | 93.7% | 95.3% | 2.3% | 2.4% | 0.1% | - | -
|
| ute | ~4.5 | 66.8% | 68.1% | 28.2% | 2.2% | 1.3% | 89.4% | 0.1%
|
Where "GFLOPS" is the average number of floating point operations per second
(in billions) during the measuring period; "Utilz'n" is the average user
utilization of the system (system downtime counts against utilization);
"User" is the percent of uptime occupied in performing computation for user
processes; "Idle" is the percent of uptime spent idle; "System" is the
percent of uptime consumed in system overhead; "WaitIO" is the percent of
uptime spent awaiting I/O completion; "IOfs" is the percent of the WaitIO
time spent in performing user filesystem I/O; and "IOswp" is the percent of
the WaitIO time spent in performing process swapping/paging.
Average DataPark and test system performance and utilization
statistics for FY1998
| System | GFLOPS | Utilz'n | User | Idle | System | WaitIO | IOfs | IOswp
|
| mouache | ~0.2 | 1.8% | 1.8% | 96.3% | 1.3% | 0.5% | 79.7% | 19.4%
|
| winterpark | ~0.4 | 18.7% | 18.8% | 57.9% | 6.3% | 16.5% | 86.6% | 12.9%
|
Where "GFLOPS" is the average number of floating point operations per second
(in billions) during the measuring period; "Utilz'n" is the average user
utilization of the system (system downtime counts against utilization);
"User" is the percent of uptime occupied in performing computation for user
processes; "Idle" is the percent of uptime spent idle; "System" is the
percent of uptime consumed in system overhead; "WaitIO" is the percent of
uptime spent awaiting I/O completion; "IOfs" is the percent of the WaitIO
time spent in performing user filesystem I/O; and "IOswp" is the percent of
the WaitIO time spent in performing process swapping/paging.
Key maintenance activities
During FY1998, SCD provided ongoing maintenance activities to ensure the
integrity and reliability of existing computational systems. Some of the key
areas were:
- Maintain supercomputer operating systems:
The Cray systems were
upgraded to the latest release of UNICOS (10.0.0) during the spring of
1998. SCD intends to stay apprised of major software releases from
Silicon Graphics/Cray Research and carefully schedule upgrades to the
production system and product set software based on judged stability of
those upgrades in the NCAR production environment.
- Maintain stability and reliability of systems:
One of the most
significant attributes of the NCAR computational environment is its
overall stability and reliability. For instance, the NCAR Mass Storage
System has a reputation for reliability, and SCD has in the last year
deployed a number of high-availability fileserver systems. This
reliability and stability does not come easily; it stems from a
combination of choosing reliable, stable vendor products and using
proven, fail-safe system administration and maintenance techniques. SCD
will continue to focus on ensuring, in whatever ways possible, highly
stable and reliable systems and systems operations.
- Maintain batch subsystems and job scheduling:
SCD continued to
maintain the existing NQE batch subsystem as the common batch interface
within the NCAR computational environment and enhanced its Batch Priority
Scheduler (BPS). In addition, SCD implemented
a new queue structure for the DSM systems in FY1998. More information
appears in the
Unified Queuing System report.
- Year 2000 compliance and testing:
During FY1998, SCD engaged itself in
a significant effort to ensure that all mission-critical resources
maintained by SCD are Year-2000 compliant. The most significant
objective is to ensure that all production systems are unaffected by
the transition into the next century. SCD has been working with its
major systems vendors to upgrade production systems' operating systems
and product set software to Year-2000-compliant versions. In addition,
SCD plans to perform testing of Year-2000 compliance of those systems
and SCD-developed software and subsystems. More information appears in the
Year 2000 planning and testing report.
- System monitoring:
Over the years, SCD has developed a large number of
system monitoring procedures, techniques, and tools. SCD continued to
enhance and utilize its collective experience to maintain the stability
of the existing production systems through this proactive monitoring.
In addition, SCD continued to enhance its monitoring tools, techniques,
and procedures, and SCD automated a number of procedures for detecting
system failure or trouble. This automation, at the end of FY1998, was
being integrated with commercial alphanumeric paging technology to
provide more rapid alert mechanisms to SCD operations and systems staff
and thus reduce the amount of time that systems are unavailable to the
NCAR user community when they do fail.
- High-speed communications:
SCD has deployed a high-speed HiPPI data
communications "fabric" within the NCAR computer center; in addition,
more traditional networking capabilities, including ATM, FDDI, and
Ethernet technologies, have become the mainstay for connectivity
between computational systems at NCAR and its divisions and remote
departments. All of these systems are routinely monitored and
maintained. SCD will continue to provide these stable, high-speed
communications interfaces and enhance them with new technologies as
those technologies prove their reliability and stability within our
environment. Furthermore, as SCD enhances its computational
capabilities, it will maintain and enhance its high-bandwidth
connections between the NCAR MSS and the supercomputers to ensure that the
systems are balanced and capable of satisfying the loads imposed by the
growing need for both computational and storage capacities. In
addition, during FY1998 SCD began evaluating new high-speed data and
communications technologies, such as Fibre Channel and Gigabit
Ethernet, as replacements for older, less-supported technologies. SCD
hopes that this evaluation can lead to the adoption of these new
technologies into our production environment -- but only after their
reliability is proven.
- DataPark system:
SCD maintained and enhanced the NCAR DataPark system.
We believe this system will provide a new paradigm for computational
data analysis as well as serving to provide lower latency access to
large-capacity mass storage. More information on the DataPark and SCD's
plans appears in the
NCAR DataPark report.
- Visualization Lab:
The SCD Visualization Lab capabilities are leading
NCAR and atmospheric sciences into a new, revolutionary period in
scientific computing. SCD maintained and enhanced, through new
high-performance hardware and innovative tools, its Visualization Lab
capabilities during FY1998. More information appears in the
Visualization Laboratory report.
- NCAR DataVision:
SCD maintained and enhanced NCAR Graphics during
FY1998, and plans to continue collaborative development of its
successor NCAR DataVision product. More information on SCD's plans are
contained in the DataVision report.
- Consultation and user services:
SCD maintained and improved
responsiveness to user needs through FY1998, including augmentation of
services provided by the SCD Consulting Office and Computer Production
staff. More information appears in the
Technical support services and the
Operational procedures and infrastructure
reports.