AHPCRC Projects
Project 4-3: Specifying Computer Systems for Field-Deployable and On-Board Systems Principal Investigators: Patricia Teller (University of Texas at El Paso), Jeanine Cook (New Mexico State University, 2007-2009), Sarala Arunagiri (University of Texas at El Paso, 2010-present) |
![]() |
|
|
![]() |
|
| Computing resource allocation in a Niagara 2 single-core processor. (CPI is cycles per instruction.) | Monte Carlo modeling results for Niagara 2 processor | |
| Graphics this page courtesy Patricia Teller (University of Texas at El Paso). | ||
Simulation & Modeling to Enhance the Performance of Systems of Multicore Processors (Project 4-3 work from 2007-2009) Resource-intensive applications, including large-scale simulations, can take weeks to execute, even on the most powerful computing systems. Thus, it is critical to design and tune software to use computing resources efficiently, and to incorporate effective mechanisms for error recovery. On the hardware side, computing systems incorporate an ever-increasing variety of processors, memory devices, and I/O (input–output) subsystems. The challenge is to build software architectures that are capable of functioning on a variety of configurations without sacrificing performance and accuracy. Patricia Teller and Sarala Arunagiri (University of Texas at El Paso), Jeanine Cook (New Mexico State University), and their co-workers and students are using a three-pronged approach to optimizing and tuning application performance on heterogeneous computer nodes: measurement, acceleration, and modeling. They are testing their concepts on Chimera, a research computing cluster with a variety of processor architectures and hardware accelerators, that was installed at UTEP in 2008. Chimera is equipped with Opteron, Niagara 2, and Cell/PS3 processor architectures, as well as hardware accelerators. (See “Terms and abbreviations,” page 5.) Initial efforts to enable accurate application-to-architecture mapping have shown good progress. Optimizing the performance of an application requires knowledge of the characteristics of the hardware system on which the application runs. It also requires knowing the characteristic resource needs of the application itself. Dynamic profiling and monitoring tools, analytical models, and simulation are used to analyze application behavior in terms of resource needs, such as CPU and memory hierarchy characteristics. This permits identification of poorly performing or frequently executed parts of the code—and, when possible, modifying the code or system software to decrease overall execution time. Measurement In the past few years, many programming models, languages, and platforms have been developed to aid programmers in porting legacy codes to new multicore, multithreaded architectures. Cilk, OpenCL, and Sequoia (article on page 2) are among the many proposed languages and platforms designed for this purpose. To facilitate performance analysis studies on Chimera, CUDA and OpenCL have been installed, along with the library CUBLAS (Compute Unified Basic Linear Algebra Subprograms), the OpenCL profiler, pyCUDA, and the PAPI patch, which facilitates use of hardware performance counters. User support documents are being developed for all of the new software installed on Chimera, and are available to users via a user-only wiki. These include a comparison of OpenCL and CUDA, which is being developed from a programmer’s perspective. Acceleration Hardware accelerators: GPUs and FPGAs are generally used to accelerate portions of an application code that can be executed in parallel with other code tasks and are amenable to being implemented on these architectures. Programming languages and development environments such as NVIDIA’s CUDA make it easier to port applications so that they execute efficiently on GPUs. Much effort has been focused on porting kernels to Chimera’s NVIDIA Tesla GPUs. Some simple matrix manipulation codes have been ported, as well as an image processing code that extracts dozens of features from an image. The UTEP/NMSU group is studying Army applications and benchmarks related to science and surveillance, with the intent of choosing two of them for GPU implementation. They are identifying candidate applications in collaboration with Army researchers. Simultaneous multithreading (SMT) is a technique for improving the overall resource utilization and throughput of superscalar CPUs with hardware multithreading. Replication of key CPU hardware (e.g., the register file and instruction buffer, one per hardware thread) permits multiple independent tasks (threads of execution) to execute concurrently on the hardware threads and share other SMT CPU resources. SMT CPU throughput depends on the amount of interference among the concurrently executing tasks that compete for SMT shared resources. Tasks with different shared resource needs do not interfere with one another and, thus, do not inhibit each other’s execution. Recent work at UTEP showed that the aggregate performance experienced by a given pair of applications scheduled to execute concurrently on a POWER5 SMT core with two hardware threads can vary significantly depending on the hardware thread priority settings. The research group developed a methodology, based on application signature sets, that, given a co-schedule, predicts priorities that will minimize application interference and deliver best throughput (IPC, instructions per cycle). The default priority settings assign equal opportunity to use the shared SMT CPU resources. An initial implementation of the methodology for an IBM POWER5 processor produced throughput gains over the default priorities between 0% (11 co-schedules) and 16.42% (9 co-schedules). For the eight co-schedules with floating-point unit usage that exceeds that of the fixed-point unit by 10% or more, the methodology yields a throughput improvement of 3.56–16.42%. This research also showed that 17 of the possible 10,000 POWER5 application signatures (application characteristics related to use of shared SMT CPU resources) sufficed to characterize 95.6% of the execution time of the applications studied: 20 SPEC CPU2006 benchmarks, 3 NAS serial benchmarks, and 10 PETSc KSP solvers. The group is currently investigating anomalies in POWER5 hardware thread scheduling. The group has documented and catalogued the set of scripts used for developing application signature sets and SMT execution of application pairs, and they have documented the design of the scripts themselves. This has produced a report that serves as a user manual for this research. I/O subsystems of high-performance computer systems generally include RAID (redundant array of independent disks) storage as a building block at levels of the memory hierarchy that experience high I/O contention. Under such conditions, I/O schedulers must provide performance isolation and differentiated service to concurrently-active clients. A performance isolation strategy is successful when each workload’s I/O performance is similar to that achieved with a dedicated storage utility of a certain fixed capacity—it guarantees each competing workload a share of storage performance. When shares are proportional to workload priorities, the storage system is said to provide differentiated service as well. Existing scheduling algorithms that isolate I/O performance and provide differentiated service are limited to single-disk systems. Recent I/O scheduling work at UTEP has developed a new I/O scheduling algorithm, called FAIRIO, which enables RAID storage systems to provide both performance isolation and differentiated service. Through detailed simulation, FAIRIO has been shown to provide isolated and differentiated service for idealized and real I/O workloads. When performance is tuned, the experienced disk-time utilization is within 4% (for idealized workloads) and 11% (for real workloads) of being perfectly proportional. Throughput is not degraded; in fact, it is marginally improved. Future work aims to demonstrate that FAIRIO can be adapted to provide proportional sharing, i.e., differentiated service, for a variety of resources. Modeling Monte Carlo Modeling: In an effort to model, in a time-efficient manner, the performance of Army applications executed on next-generation systems, the group has adopted a Monte Carlo modeling methodology to model, predict, and analyze the performance of contemporary multi-core architectures: the Sun Niagara, the IBM Cell Broadband Engine, the Intel Itanium 2, and the Opteron processors. During 2009, the existing Monte Carlo methodology was enhanced at NMSU with a technique to model out-of-order instruction execution, and work began to enhance this methodology with power models. A modeling framework was implemented that enables users to develop Monte Carlo models of contemporary and future multicore architectures. Performance characteristics predicted by the models are validated against the performance of their real-world counterparts. Validation of the Niagara 2 single-core model has been completed; all model predictions are now within 7% of measured values. After analyzing validation results, the Niagara 2 single-core model was adapted to include the latency load for load–load instruction sequences and data forwarding within the memory and floating-point pipelines. The Niagara 2 multi-core model has been completed, and validation data has been collected. Data have been collected for the initial multithreaded Niagara 2 model. The initial Monte Carlo Opteron model has been completed. The methodology to implement out-of-order execution is done. The model predicts very accurately, and full validation is in progress. The initial design of the methodology for the Opteron multi-core model has also been completed. The researchers are actively integrating the existing Opteron and the Niagara 2 models into the SST (Structural Simulation Toolkit) exascale system simulator at Sandia National Laboratories. The SST was released under gnu license in 2009. Power modeling tools and techniques available for emerging architectures are another area of study, with a special focus on architectures on Chimera. Methods used for indirectly measuring CPU power using performance counters and contemporary methods for measuring GPU power are being investigated. At present, power for GPUs and FPGAs is often estimated, but for many applications, these estimates are known to be inaccurate. Direct power measurements are being studied, along with methods for validating these measurements. A report is available on the user wiki. Modeling for Fault Tolerance: Checkpoint/restart is a common technique that provides fault tolerance for applications executing on massively parallel processing systems. Checkpointing reduces the amount of time and effort wasted when a long software process is interrupted by a hardware or software failure. Checkpoints store data to persistent media such as a file system to enable a process to be restarted from the latest checkpoint rather than starting again at the beginning. The time interval between checkpoints must balance two competing priorities: frequent checkpoints minimize computational losses in the event of a failure, but too many checkpoints can significantly slow the execution of the program. Existing models determine the checkpoint interval that minimizes wall-clock execution time of an application. UTEP researchers have developed another model that identifies a checkpoint interval that minimizes the aggregate number of checkpoint I/O operations. The UTEP group illustrated the existence of such propitious checkpoint intervals using parameters of four massively parallel processing systems: Red Storm, Jaguar, Blue Gene/L and a theoretical PetaFLOPs system. Using both of these models provides application programmers with a basis for finding a checkpoint interval that balances application execution time and the frequency at which an application performs checkpoint operations. Future work will investigate the use of these models to schedule the checkpoint I/O, called defensive I/O, and productive I/O of multiple concurrently-executing applications. Applications Source: AHPCRC Bulletin Vol. 2 No. 1 (2010) |
||



