AHPCRC Projects
Project 4–1: Stream Programming for High Performance Computing Principal Investigators: Alex Aiken, William Dally, Patrick Hanrahan (Stanford University) |
![]() |
|
|
![]() |
|
| Schematic of Sequoia’s hierarchical memory design. | Virtual levels in Sequoia represent an abstract memory hierarchy without specifying data transfer mechanisms, giving the programmer the ability to adapt data flow to a variety of machine architectures. | |
| Graphics this page courtesy Alex Aiken(Stanford University). | ||
Parallel programming is an intrinsic part of high performance computing. Whether a programmer is adapting existing software or implementing new functionality, codes must be designed to run accurately, reliably, and efficiently on systems with tens to thousands of processors working cooperatively. Alex Aiken, Professor of Computer Science at Stanford University, says, “Programmers tend to think of parallel programming as a problem of dividing up computation, but in fact the most difficult part of parallelism is often communication, simply moving the data to where it needs to be.” Aiken, along with Stanford computer science professors William Dally and Patrick Hanrahan, is working to develop the Sequoia programming language. Their efforts, supported in part by AHPCRC, will provide Army researchers with the ability to port parallel programs to many types of computing systems and architectures without sacrificing performance. To write a program that achieves the best performance on a specific system, a programmer must understand the system and design the code to fit the specific characteristics of the system. Code that works especially well on one architecture may not achieve anywhere near the same level of performance on a system with a different size or structure. Conversely, programs written to be highly portable may not perform optimally on any system. The Sequoia language seeks to address this problem by allowing programmers to write code that is functionally correct on any system, then tune the performance to the characteristics of a specific system, using the underlying Sequoia interface. Memory Hierarchies High-performance parallel architectures, including IBM’s Cell (high-throughput) and NVIDIA and ATI’s GPUs (graphics processing unit, see “Terms and Abbreviations,” page 5) processors, increase performance and efficiency by allowing software to manage a hierarchy of memories. (One example is shown above.) Such systems are highly parallel—they consist of many processing elements (PEs) operating in isolation, drawing data only from their own small, fast local memory devices—the “leaves” of the memory “tree.” Individual PEs do not have access to the entire memory hierarchy, and there are no data caches. A conventional high-latency, low-bandwidth external memory device serves as the “root.” Between the root and leaves are various memory structures such as on-die storage, local DRAM (dynamic random access memory), or remote memory with high-speed interconnects. Data and code move between levels in the hierarchy as asynchronous block transfers explicitly orchestrated by the software. This “exposed-communication architecture” requires programmers to build into the software the directives to move data between nodes at adjacent levels of the memory hierarchy. Explicit management of the memory hierarchy gives the programmer direct control over locality, allowing the programmer to write locality-aware programs and thus improving performance. The node-level orchestration aspect bears emphasizing, because parallel codes have typically addressed the horizontal communication of data among nodes of a machine. Newer architectures also require managing the data as it moves vertically between levels of a memory hierarchy. The Sequoia language places data movement and placement explicitly under the control of the programmer. Machine architecture is represented in the language as abstracted memory hierarchy trees (schematic, above). Self-contained computations called “tasks” are used as the basic units of computation. Tasks are functions, and thus, free of side effects. When a task is invoked, it is normally run at the next lower level of the memory hierarchy; a parallel loop with a task as its body launches the task calls in parallel on “children” of the current node. Task parameter passing is copy-in-copy-out, so a task will copy its argument data from the “parent” memory, run the task locally one level lower in the memory hierarchy, and then copy the results back to the parent. Tasks provide for the expression of explicit communication and locality, isolation and parallelism, algorithmic variants, and parameterization. These properties allow Sequoia programs to be portable across machines without sacrificing the ability to tune for performance.
Sequoia programs may have multiple mappings, one for each target architecture. These mappings are, of course, different. For instance, the best block size for one machine will be very different from the best block size for another machine. The portions of the program that deal with the higher-level code common to all machines are kept separate from the part that handles machine-specific mapping and optimization. Programmers can control all details of mapping an algorithm to a specific machine, including defining and selecting values for the tunable parameters. “Tunables” allow the programmer to specify the initial sizes of arrays, blocks, and data structures. Tunables help keep programs machine-independent, but suitable tunable values must be found for each given architecture. Because tunables interact, the space of possible tunable values can be very large and complex. The Stanford team has completed the design and implementation of a Sequoia “autotuner” that automatically searches the space of tunable parameters for the highest performance combinations, relieving the programmer of specifying all but the most critical tunables. In fact, to date they have yet to find an example of a program where any choices made by a programmers are superior to the program tunable values set by the autotuner. In at least one program, programmers were never able to find suitable tunable settings by hand, but the autotuner was able to find a high performance combination of tunable values. Language for Location Dynamic Data Subdividing graph structures requires balancing parallelism and locality considerations. That is, connected nodes of the graph are grouped to enhance re-use and to ensure that processors that produce information are located near those that use this information. At the same time, nodes along an activity front should be distributed to enable parallelism. One way to strike the right balance between these two goals is to optimize the program’s execution using run-time libraries in Sequoia that perform graph partitioning and distribution. Another approach is to invoke a portion of the compiler at run time to reoptimize the partitioning periodically as the data structures evolve. Irregular Access Putting Sequoia to the Test A major system of tests of Sequoia programs was completed on Cerillos, the open version of the Roadrunner supercomputer at Los Alamos National Laboratory (see photo). Roadrunner, the world’s first petaflop computer, and until recently the top supercomputer in the world, has a very deep memory hierarchy, making it an ideal testbed for Sequoia applications. Several scaling issues have been identified as a result of this implementation, and these have been corrected. This work has also pointed to a new research direction in the layout of data; specifically, understanding in a language with explicit locality how a programmer may specify sophisticated mappings of data to processors, such as the partially replicated data that arises in programs with “ghost” cells. The Road Ahead One of the largest remaining obstacles to programmer productivity is writing high-performance “node code,” sequential kernels for the complex instruction sets of contemporary processors. The group plans to provide more semi-automatic support for the performance-oriented kernel programmer, including taking advantage of vector hardware (which, in the Sequoia view of the world, is just another level of the memory hierarchy). The Stanford group is initiating work with Center collaborators on additional, larger applications in early 2010. In addition, release of the existing compiler to Army and other users is projected for early 2010. [Editor's update: this has been released.] References: Fatahalian, K., Knight, T. J., Houston, M., Erez, M., Horn, D. R., Leem, L., Park, J. Y., Ren, M., Aiken, A., Dally, W. J., and Hanrahan, P. 2006. Sequoia: Programming the memory hierarchy. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing. Source: AHPCRC Bulletin, Vol. 2 No. 1 (2010) |
||



