ER01: Fault tolerance for high performance computing

George Bosilca et Thomas Hérault,

University of Tennessee Knoxville

December 10-14, 2012, ENS Lyon.

During this course, we will discuss fault tolerance techniques for high performance computing systems.

Motivation.

In June 2008, the LANL’s Road Runner computing system was the first to cross the hallmark of one Petaflop based on the Linpack benchmark (reaching $1.026 \cdot 10^{15}$ double precision floating point operation per second). Today, only 4 years later, the fastest computing system, the Sequoia BlueGene/Q at LLNL, sustains 16.324 Petaflops. Continuing on the same path, it is expected that as early as 2022, computing systems will conquer the 1 Exaflop milestone ($10^18$ flops). After the multicore revolution of 2000, such performance became achievable only by multiplying the number of computing components (cores) in the parallel machine. When the 2008 Road Runner machine had 122,400 cores, distributed over 20,000 heterogeneous nodes, today’s Sequoia systems exhibits no less than 1,572,864 cores, distributed over 100,000 nodes. The drawback of such exponential increase in the number of components is a disruptive decrease in the reliability of the entire

system. Indeed, the Mean Time To Interruption, defined as the average duration for which all the components of the platform behave as specified, is inversely proportional to the number of components in this platform. When in 2009, the MTTI of the largest supercomputers was estimated close to 4 days, it sharply decrease below a single day (19h04 min) in 2011, and projected to continue its descent in the following years (less than 4h by 2015, less than 2h by 2018). While technical breakthrough will hopefully allow the MTTI to remain in a tolerable area, long running applications need an execution environment more flexible, an environment allowing them to survive for longer period of time either algorithmically or by taking advantage of specialized capabilities of the programming model. Unfortunately, current programming paradigms do not account for such a hardware instability; the failure of a single component (be it memory or computing unit, hard

drive or network card) has a drastic and lasting effect on the computation. Even without accounting for the implicit energy costs, in front of the evolving expectations for the next generation computing platforms MTTI, such approaches cannot reasonably lead to sustainable software solutions.

Overview.

This course will highlight the existing solutions, and present parallel algorithms designed to provide correct solutions despite the presence of failures. We will show how the programming environment can be modified to increase the system reliability and sustainability. The course will start with an introduction to High Performance Programming, its challenges and typical algorithms.

Then we present theoretical as well as practical studies of the many fault-tolerant approaches that have been proposed over the years.

Prerequisite.

The course assumes a general knowledge of algorithms and operating system, as well as a basic understanding of machine architecture and network systems.

Update: bring your laptop! Some exercises on machine will be proposed along the course. Attendants are expected to bring a laptop with a possibility to establish a connection to a distant machine (typically, ssh): please write to the local contact (Yves Robert) if this is not possible for you — we should be able to lend a few laptops to participants.

Local contact: Yves Robert.

Registration

Registration is free, but there is a limit on the number of participants and it does include neither housing nor meals (though for lunch the attendees will be granted access to the student cafeteria). Registration should be made before Monday, December 3, by clicking on this link and filling in and sending the e-mail form. You shall receive a confirmation as soon as possible.

Alternatively, you can copy/paste and fill the following form and send it by e-mail to research.school.1@gmail.com, with subject “Registration form — research school 1”

First Name:

Last Name:

Institution:

Position (MSc student, PhD student, researcher, etc.):

E-mail address:

wishes to attend the research school ‘Fault tolerance for high performance computing’, taking place at ENS Lyon, from Dec. 10 to Dec. 14