SCD FY97 Annual Scientific Report

Modeling 2000

This is a status report on the progress made during year one of a three-year experimental project. The purpose of the Modeling 2000 project has expanded since the initial plans were devised. Initially the focus was to develop a general programming paradigm that is portable, scalable, and efficient on the next generation of high performance computers -- Distributed Shared Memory (DSM) systems. As the fiscal year progressed it became more and more likely that we would need a contingency plan in case the planned procurement of NEC supercomputers was not brought to fruition. Therefore, the scope was broadened to include evaluation of the production computing capabilities of DSM systems. This additional evaluation included system reliability, I/O bandwidths, programming environment, system administration requirements, and raw HIPPI for connection to the mass store system.

In terms of developing a portable, scalable, and efficient programming model for DSMs, the most natural programming paradigm seems to be a hybrid of message passing and multi-threading, where message passing is used between nodes and multi-threading is used within a multiprocessor node. This model accounts for the different levels of memory hierarchy where the memory is "flat" within a node and additional latency is added between nodes as well as potentially lower bandwidth. The Computational Science Section is comparing and contrasting the performance of the hybrid programming paradigm with two other possible approaches:

The vehicle that we are using for conducting this experimental comparison of programming approaches is both a full general circulation model and a "modular 3-D dynamical core testbed." Working with a complete global climate model, O(105) lines of code, complicates the task at hand because of the size of the code but it provides feedback to how these comparisons are materialized in full applications. Working with a simplified dynamical core of a general circulation model that is small enough to be manageable but still captures the necessary algorithmic features of climate model dynamics is more suited for simple experiments. The testbed will be constructed in a modular way to allow both the spherical harmonic transform or spectral method and local grid point dynamical schemes to be evaluated. This modular testbed will provide a framework for ultimate transfer of the programming expertise into real general circulation models for climate simulation and weather forecasting.

The basis for the fundamental testbed design choices will be established through a series of kernels that will include a parallel spectral transform module with parallel FFT and Legendre transform components, a parallel stencil module for the local grid point methods, and a semi-Lagrange advection module. Also, an atmospheric column physics kernel, extracted from the NCAR Community Climate Model, will provide the basis for assessing the efficacy of various load balancing strategies. Additionally, efficient communication algorithms are being studied in the context of dynamic data redistribution between different decompositions in the dynamical core testbed. This is necessary since optimal data distribution may be different for different portions of the model physics and dynamics.

Broadly applicable results anticipated

Although the Modeling 2000 project might seem to be targeted very specifically at climate modeling, the programming paradigm issues as well as the production computing capability issues being studied have broad applicability. The codes used in the experiments capture features generic to a many classes of algorithms used in modeling fluids and other dynamical systems. The FFT's, Legendre transforms, and stencil operations are examples of a broad class of algorithms using fixed, regular communication patterns. Efficient distributed FFTs are important to large-scale turbulence calculations in computational fluid dynamics. Semi-Lagrangian advection is representative of a class of difficult algorithms with dynamic, irregular communication patterns. The column physics probes generic single-processor optimization issues such as cache utilization, loop blocking, and loop fusion. In addition, the evaluation of these types of systems from the production computing perspective and the work we do with vendors in improving their products will benefit the entire user community.

Initial results to date indicate that within a single global shared memory we are observing nearly identical performance for a single code developed with strictly message passing (MPI) and for the same code utilizing shared memory multi-tasking directives. Additionally, our experience to date on the HP SPP2000 has been that this 64 processor system has sustained over 95% availability since mid-July. With the exception of some immaturity in the F90 compiler and its integration with debug and performance analysis tools, the programming development environment on these systems is very useful for developing, debugging, and tuning parallel codes. HP has made rapid progress on developing raw HIPPI for connecting this system to the NCAR mass store.


| Next page | Top of this section | Table of contents |

| NCAR | UCAR | NSF | NCAR FY97 ASR |