CSAR Seminar

SPEAKER: Gengbin Zheng, UIUC/CSAR

TITLE: A Parallel Runtime System for Achieving High Performance on Large Parallel Machines

DATE: Wednesday, April 2, 2008
TIME: 12:00 Noon
PLACE: 2240 DCL
1304 W. Springfield Ave., Urbana, IL

ABSTRACT

Parallel machines with an extremely large number of processors are now in operation. For example, the IBM BlueGene/L machine has 104K dual processors with 478 teraFLOPs sustained performance. It is a significant challenge for application developers to write new parallel programs or port legacy codes on such large parallel machines to exploit the enormous compute power and scale their applications. The application developers have to deal with various issues such as load balance and fault tolerance for sustained performance.

In this talk we describe several techniques used in parallel runtime systems Charm++ and AMPI that allow complex irregular and dynamic applications to be developed quickly and perform scalably on large parallel machines. One of our core techniques is based on the idea of processor virtualization — the programmer divides the computation into a large number of entities, which are mapped to the available processors by an intelligent runtime system. This separation of concerns frees the programmers from thinking about the number of processors when writing applications, while allowing an intelligent runtime system to optimize the application load balance and provide fault tolerance in a way that is application independent.

This talk will mainly focus on automatic dynamic load balancing for AMPI applications. We will describe techniques used in migrating MPI processes across processors for achieving global load balance. Various application independent load balancing strategies that calculate optimized work-to-processor mappings to evenly distribute workload on processors are discussed. The same techniques are also used in our fault tolerance schemes based on automatic checkpoint/restarting.