Introduction to SCD's FY1998 Annual Scientific Report
The NCAR Scientific Computing Division (SCD) provides computing resources
and expertise to university scientists who are grantees of the Geoscience
Directorate of NSF, to NCAR scientists, and to special projects -- national
and international -- that are relevant to atmospheric and related science.
SCD works closely with its Advisory Panel, the NCAR Director's Committee,
and NSF to establish priorities and plans. Our
schematic of the NCAR computing facility illustrates the various computing
resources that are currently available. The figure below shows the
configuration at the end of FY1998.
Community resources
SCD provides high performance computing facilities to support the NCAR
scientific program and the supercomputing needs of the Community. The
facilities include a substantial infrastructure of compute servers,
networking, visualization, data communications, and user services to
support the development and execution of large, long-running models and
the manipulation, analysis, and archiving of extremely large datasets.
In FY1998, the equipment delineated as "Community Computing" in the
functional diagram above was used by approximately 550 university
scientists and approximately 500 NCAR researchers in support of the
following areas of research:
The total computing capacity available to the community was doubled in FY1998.
NCAR Climate Simulation Laboratory
NCAR, in collaboration with the NSF Atmospheric Sciences, Ocean Sciences,
and Mathematical and Physical Sciences Divisions, operates a national,
special-use, dedicated climate system modeling computing facility, the
Climate Simulation Laboratory (CSL). The CSL is a multi-agency, United States
Global Change Research Program (USGCRP) facility that provides high
performance computing, data storage, and data analysis systems to support
large, long-running simulations of the Earth's climate system.
The total computing capacity of the CSL was doubled in FY1998.
Mass Storage System (MSS)
A major SCD objective is to maintain an MSS that is commensurate with the
compute power available to users and is capable of handling the
increasing file sizes and data transfer rates required by users'
applications. Significant upgrades to the MSS have been made over the past
couple of years in anticipation of increased compute power and growing user
needs. Specifically, new storage media capability and additional robotic
tape handling capability have been integrated into the MSS. The amount of
data SCD can archive has increased by a minimum of a factor of four.
In addition, data rates have been significantly increased.
Major accomplishments during FY1998
Computing capacity was doubled
During FY1998, these changes in the SCD production supercomputer environment
doubled the overall computing capacity:
- The installation of a new Cray J90se/24-1024 (chipeta) in the
Community computing center
- The installation of a new Silicon Graphics Cray Origin2000/128 (ute)
in the CSL
- The decommissioning of the Cray T3D-128 at the end of FY1998
- The transition of the Hewlett-Packard SPP-2000/64 from an HPCC
resource to a Community computing resource
Supercomputing system advances
Significant enhancements to the NCAR user environment in FY1998 included:
- The Batch Priority Scheduler, and a subset Batch Dedicated Scheduler
for DSM architectures
- Utilization of the Network Queuing Environment and File Transfer Agent
- Supercomputer system monitoring and reporting
Progress in transitioning to and using highly parallel technology
In December 1997, after a lengthy evaluation of domestic supercomputer
technology by SCD and select representatives from other NCAR divisions,
it was decided to invest the FY1998 Climate Simulation Laboratory (CSL)
computational budget in a 128-processor Silicon Graphics Cray Origin2000.
The system was delivered to SCD on May 18. At the end of FY1998, the
Silicon Graphics Cray Origin2000 (ute) was being used extensively. System
utilization has climbed from approximately 50% in July to exceed 80% by the
end of the fiscal year. This high level of utilization is exceptional and
unexpected for a DSM architecture. (Other centers typically observe, and
have expressed concerns that, DSM architectures cannot be expected to
exceed about 2/3 utilization.)
DataPark
In FY1998, SCD added 130 GB of local disk storage to the winterpark system.
This additional local disk enabled special projects to access dedicated
blocks of high-speed storage for long periods of time. The first two such
projects were SCD's Data Support Section's Reanalysis Project and the
joint Climate and Global Dynamics Division-SCD CSM Post-Processor Project.
Year 2000 compliance and testing
A preliminary UCAR Year 2000 Plan was drafted during the first quarter of
FY1998 and submitted to the NSF for review. The UCAR plan has five phases
corresponding to those identified by the GAO. This plan attempts to
ensure that all UCAR-owned and/or managed systems are "Year 2000
compliant" well before 1 January 2000.
Visualization Lab
The confluence of harnessed commodity computational power, capacious
storage, display technologies, and high-bandwidth networks constitutes
the critical mass for the next generation of computing: an era of
visual computing.
During FY1998, the Visualization Lab was upgraded in several important areas.
Most importantly, the 4-year-old flagship visual supercomputer was replaced
by an 8-processor R10000 Silicon Graphics Onyx-2 Infinite Reality Engine
with at least 2 GB of physical memory. This system and high performance
networking has enabled forays into new and more difficult scientific and
technical domains.
Networking infrastructure enhancements
Jeffco Network Infrastructure Completion (JEFNIC) project
Ethernet packet-switch re-engineering project
Router backbone re-engineering project
Network and computer security
Based upon recommendations of the UCAR Computer Security Advisory Committee,
SCD led a UCAR-wide project that implemented significant new gateway route
filters to greatly improve network security for UCAR. Preparations and
installation of these filters was highly opportune because hacker probes
of UCAR computer defenses had been swelling in prior weeks. These probes
caused some significant problems and consumed a significant amount of
staff time throughout UCAR. Most of the problems ceased after the
installation of the security filters.
End-of-life systems
In compliance with a recommendation from the most recent five-year review
by NSF, SCD established in FY1997 the End-of-Life Systems (EOLS) project to
develop definitions and procedures for reviewing all existing supported
hardware and software systems in SCD to determine whether support of each
system is still justified. SCD's ability to adopt and support new
technologies often depends not only on its resources, but on its ability
to retire older systems and services. Charles Dickens of Stanford University
appropriately stated: "The ability to advance the leading edge of
technology is constrained by the ability to prune the trailing edge."
A wide range of hardware and software was
retired in FY1998.
Trouble Ticket System
SCD is developing a Trouble Ticket system that will provide a homogeneous
method for different SCD sections to track and provide solutions for user
problems and requests. The system will provide a knowledge base and a
collective set of solutions to known problems, thereby increasing the
efficiency of the division and its users.
By the end of FY1998, we had:
- Completed and demonstrated the system interface to SCD staff
- Started friendly user testing for SCD staff
- Demonstrated a prototype web interface
U.S. atmospheric science modelers currently enjoy global leadership in
several areas of research that depend on high performance computers.
To maintain that leadership, they need computing capabilities that are
comparable to their international peers. For example, a 1-km regional
forecast using 4DVAR with full physics adjoint is feasible, but to use
such in time-critical (less than one hour) forecasting will probably
require a machine that can sustain at least 50 GFLOPS. Another example
is a recently developed NCAR global chemistry model (MOZART). To
complete 100-year simulations of the climate within a reasonable
timeframe, this model needs a computer that can sustain 20 to 40 GFLOPS.
The situation is particularly acute in climate modeling and is exemplified
by the computational requirements of the NCAR Coupled System Model (CSM).
Currently, the primary computer for executing the CSM is the Cray C90. It
takes approximately 16 days to simulate 100 years of climate, with the
model running 24 hours a day at five GFLOPS. Scientists routinely need
to simulate several climate scenarios and perform at least four sensitivity
studies for each scenario. Thus, a single 100-year study may involve 20 or
more 100-year simulations and each 100-year simulation produces hundreds
of gigabytes of data that must be archived and analyzed. Within two years,
the computational requirements of the CSM will quadruple due to a modest
increase in resolution and enhancements in the model's semi-Lagrangian
dynamics and prediction of cloud water, and the addition of a sulfate
aerosol model. At that time, at least one machine with approximately
40-GFLOPS capability will be required to execute a 100-year CSM simulation
in a reasonable amount of time (approximately one calendar week). Within
five years, a machine capable of sustaining a trillion floating-point
operations per second (TeraFLOPS) or more will be needed to provide similar
turnaround time due to significant increases in resolution requirements and
the inclusion of a global atmospheric chemistry model such as MOZART, which
is being developed by the Atmospheric Chemistry Division of NCAR.
A brief history of supercomputer architecture -- An exercise in speed
versus economics
The desire of scientists to expand the set of solvable problems produces
a constant need for more powerful supercomputers. This led to the
development of vector processing capability in the mid-70s. Then in the
80s, several vector processors were incorporated into a single system to
create Parallel Vector Processor (PVP) supercomputers. Processors in a
PVP typically share a common memory that each processor can access
uniformly in time. Thus, PVPs are members of a larger class of
architecture known as Symmetric Multi-Processor (SMP) systems diagrammed
in Figure 1 below. PVPs continue to be among the most powerful
supercomputers available. For example, 1999-vintage vector processors
will sustain up to 2-3 billion arithmetic operations per second (GFLOPS),
and parallel execution of them in a 1999-vintage PVP may sustain 50 to
250 GFLOPS.
By the mid-80s, microprocessors offered attractive performance per
unit-of-cost. This made possible Massively Parallel Processor (MPP)
systems containing hundreds, even thousands, of microprocessors. Typically,
the processors in a MPP do not share a common memory. Rather, each has its
own memory, and an inter-processor communication system (typically, some
form of message passing) enables the moving of data between processors.
By the mid-90s, several large computer manufacturers had developed
parallel systems that
incorporate tens of low-cost microprocessors wherein each microprocessor
does not have uniform access to the shared memory. Further, each
microprocessor has one or more levels of cache. In order to manage these
processors with a single operating system, coherence of data in the caches
must be maintained; i.e. if a datum is changed in the cache of one
processor, then it must be changed in all other caches that it resides in.
Such systems are called cache-coherent Non-Uniform Memory Access (ccNUMA)
systems (see Figure 2). These are widely used as UNIX servers -- a
multi-billion dollar market. ccNUMA systems are a subset of Distributed
Shared Memory (DSM) systems wherein data consistency is maintained by
either cache-coherence in the hardware or by software implemented on top
of message passing, thus preserving the programming ease of and
portability with SMPs.
Today, the dominant trend in high speed computing architecture is to
cluster dozens of shared memory systems of various types. For example,
PVPs such as the Silicon Graphics SV1 and NEC SX can be clustered;
also, DSM systems such as the HP SPP and Silicon Graphics Cray
Origin2000 can be clustered. Each shared memory
system in a cluster is called a "node" and may have from tens to
hundreds of processors (Figure 3). Well-established multitasking
techniques are used for programming within a node, and message passing
is used for communication among nodes. Clusters make it theoretically
possible to apply thousands of vector processors and/or tens of thousands
of microprocessors in parallel to a single application. Such systems
have the potential to sustain trillions of arithmetic operations per
second (TeraFLOPS).
The options
During FY1997-1998, NCAR staff conducted a review of the plans of U.S.
computer manufacturers. From that review it became evident that to
achieve the high levels of performance needed, (40 GFLOPS or more),
two or more parallel systems must be clustered. The review also revealed
that several U.S. manufacturers will offer clusters of DSMs by FY1999-2000,
and one will offer a cluster of PVPs.
In FY1998, the DSM that performed best on the NCAR CCM provided a
substantially better performance per unit-of-cost than U.S.-manufactured
PVPs. So, in FY1998, NCAR purchased a 128-processor DSM (Silicon Graphics
Cray Origin2000) for
the CSL. The machine became operational in July 1998, and is being
heavily used by the NCAR CSM and the DOE Parallel Coupled Model (PCM).
It provides 2-3 GFLOPS of sustained performance per million dollars
of cost and its capacity exceeds that of the Cray C90.
The DOE ASCI project is developing technology for clustering Silicon
Graphics Cray Origins,
and the CSL will have the option to take advantage of that development if
it proves successful. However, the next generation of U.S.-manufactured
PVPs may provide a substantial increase in both performance and performance
per unit-of-cost, and the manufacturer will support clusters of them. SCD
is positioned to go either direction. Demonstrated performance,
performance per unit-of-cost, and ease of use will be key factors in
choosing between the two options.
Organization of this review
SCD's FY1998 activities are reviewed in the following sections of this
Annual Scientific Report:
High performance computing
NCAR Mass Storage System
High speed network and data communications
Computing environment and support services
Research data
Computational science
Operational procedures and infrastructure
Publications
Community service and educational activities
Staff, visitors, and collaborators