AHPCRC Projects

Project 4-2: Flexible Architecture Research Machine (FARM)

Principal Investigators: Kunle Olukotun, Christos Kozyrakis (Stanford University)

  FARM components  
  Current configuration of FARM
Graphics this page courtesy Kunle Olukotun, Christos Kozyrakis (Stanford University).

As heterogeneous systems that combine CPUs, GPUs and FPGAs (central processing units, graphics processing units, and field-programmable gate arrays, see “Terms and Abbreviations, page 5) become more common, it is necessary to develop and customize software and hardware in tandem to ensure that both achieve optimum performance. A more accurate picture of parallel software performance emerges when this software can be tested at full scale and full speed, but the ability to perform such tests is limited by the availability of large-scale computing resources. A readily available, reconfigurable testbed could facilitate algorithm and software development and provide a means of testing new architectures.

Stanford University Electrical Engineering and Computer Science professors Kunle Olukotun and Christos Kozyrakis are developing the Flexible Architecture Research Machine (FARM), a vehicle for hardware/software codesign, intended to accelerate architecture and algorithmic research on novel parallel models. FARM facilitates realistic application development environments for tightly-coupled heterogeneous systems, combining commercial server technology with FPGAs to provide a flexible and scalable high-performance parallel machine that can run full-sized applications at full hardware speeds.

Like the Cray XD1 supercomputer, FARM integrates CPUs and FPGAs; but FARM goes further, with the inclusion of GPUs. Moreover, FARM connects the FPGAs directly to the CPUs through cache-coherent links to maintain the consistency of data stored in local caches of shared resources (illustration above), which provides for faster and finer-grained FPGA–CPU communication and allows researchers to use the FPGAs to enhance the memory system with transactions or streams.

For algorithm development using existing architectures, the FARM can be used as a high-density, high-bandwidth supercomputer. For architecture and software research on novel architectures, the FPGAs can be programmed to introduce new functionality
into the memory system. Unlike commercial CPU–FPGA systems, the FARM CPU and FPGA communicate using cache coherent hypertransport links (bidirectional high-bandwidth, low-latency point-to-point links). The application hardware block is defined by the application. Coherency support makes it possible for the CPU and FPGA to communicate in a fine-grained manner with very low latency (delay between the executable instruction commanding an action and the hardware performing the action), and it allows the FPGA to “cache” shared data inside the configurable coherent cache. This capability makes it possible to implement protocols that interact directly with the memory system.

The Stanford group has a fully operational FARM system (diagram, previous page) consisting of 16 AMD Opteron CPU cores and one Altera FPGA. The completed FARM prototype system has been used to prototype a hybrid hardware–software transactional memory (hybrid-TM) system that can run full-sized TM applications—an example of the good performance achieved through the careful interplay and codesign of the TM software and hardware. This codesign capability was only possible because the Stanford group was able to change hardware and software at the same time and still experiment with realistic full-size applications using the FARM environment. Transactional memory promises to reduce substantially the difficulty of writing correct, efficient, and scalable concurrent programs.

The Stanford group has implemented two versions of the hybrid-TM system, one optimized for large transactions and one for small transactions. Both versions achieve substantial performance improvements over a software TM system for their target transaction sizes.

In the course of developing the hybrid-TM system, the Stanford group created a generic cache coherent interface inside of the FPGA that makes it much simpler to prototype other application accelerators. The working high-speed (200 MHz) cache-coherent interface between the multi-core CPUs and FPGA chips uses coherent hypertransport. This is one of a few systems in the world that has this capability. The base prototype was purchased from A&D Technology. Considerable engineering effort was expended in developing and improving the FPGA design to get the system working reliably and at high speed. Drivers have been developed for FARM using both the Open Solaris and Linux operating systems.

Two techniques have been developed for tolerating the latency of fine-grained asynchronous communication with an out-of-core accelerator. These techniques are applicable to any accelerator, but only work with a cache-coherent coupling between the FPGA and the CPU. A system for Transactional Memory Acceleration using Commodity Cores (TMACC) has been designed that uses general-purpose out-of-core Bloom filters to accelerate the detection of conflicts between transactions. A complete hardware implementation of TMACC using the FARM is the only hardware implementation that the Stanford group is aware of that handles large transactions. The potential of TMACC has been demonstrated by evaluating the implementation using a custom micro-benchmark and the full STAMP benchmark suite. For all but short transactions, it is not necessary to modify the processor to obtain a substantial improvement in TM performance. For medium to large transactions, TMACC outperforms a software-only TM system by 2–5 times, showing maximum speedup within 8% of an upper bound on TM acceleration.

Eventually, the FARM system will be scaled beyond a single node and the software infrastructure will be developed to make heterogeneous systems easier to program. Ideally, the system will include enough flexibility to satisfy programmers, without sacrificing an excessive amount of speed or introducing undue complexity into the system. In addition, the system must be amenable to adaptation as newer technologies and capabilities evolve. Discussions are in progress with ARL/CISD about how FARM might be used to accelerate applications of interest to the Army, including work in the machine learning area.

FARM Specs

FARM combines commercial server technology with FPGAs to provide a flexible high-performance parallel machine. The basis for FARM is a conventional blade server that accommodates multiple 64-bit Opteron blades, each with a multi-core chip, DRAM DIMMs, and a PCI-Express connection for high-end GPU board.

FPGAs are introduced by removing one of the Opteron blades and introducing in its place a commercially available blade with a high-density FPGA chip. The FPGA blade is directly connected to the Opteron blades through a cache-coherent Hyper-Transport link.

The FPGA blade can access the DRAM, GPU, and network resources available in other blades without interrupting the CPUs. The high-speed network interfaces (e.g. Infiniband or 10G-Ethernet) and appropriate logic in the FPGAs makes it possible to extend communication protocols and memory models across multiple blade chassis in on a standard server rack.

Overall, a single FARM rack will include up to 126 Opteron chips (504 cores, 32 TFLOPS – double precision), 72 GPUs (144 TFLOPS – single precision), 72 FPGAs (~21 million LUTs) and 1 Tbytes of DRAM. The exact balance of depends on the mix of boards and components used in the specific machine configuration.

FARM runs on OpenSolaris, an open-source, Unix-based operating system based on Sun Microsystems’ Solaris.

Source: AHPCRC Bulletin Vol. 2 No. 1 (2010)