Due to the increasing number of nodes in supercomputers, scientific applications are frequently interrupted by failures. Ensuring a correct and uninterrupted application execution requires faulttolerance mechanisms, consisting of Checkpoint/Rollback mechanisms. A checkpoint of the application is taken periodically ; that is, the state of the application is written onto reliable storage. Whenever one of the computing resources experiences a failure, the application pauses and restarts from the last valid checkpoint. The optimal checkpointing period can be computed by the well-known Young/Daly formula, which applies to applications where a checkpoint can be taken anytime during the computation. However, many scientific applications exhibit a more complicated behavior. For example, checkpointing at the end of an iteration is efficient for iterative applications since the volume of data to checkpoint is dramatically reduced at that point. In addition, the batch scheduler, a key component of the supercomputing infrastructure, is also affected by failures.
This thesis designs fault-tolerant algorithms for iterative applications to minimize the makespan and designs scheduling heuristics for batch schedulers to improve the performance of the platform.
The first two works focus on iterative applications but with different application models. In the first work, we consider the stochastic iterative applications as a linear chain whose tasks do not have constant execution times but obey some probability distributions. We propose the static checkpointing strategy and the dynamic checkpointing strategy to minimize the makespan and show that the Young/Daly formula can be safely applied to stochastic iteration applications. The second work considers the deterministic iterative applications as a chain of cyclic tasks. We propose the optimal checkpointing strategy computed in polynomial time and show that the strategy extended from the Young/Daly formula is suboptimal.
The final work concerns batch schedulers. For traditional scheduling heuristics, the failed job should wait until enough nodes become available for its re-execution. We propose a novel scheduling heuristic to schedule the failed job efficiently. Simulations using traces from the Mira supercomputer at Argonne National Laboratory show that the new approach could improve the utilization of the platform and dramatically reduce the flow of large jobs at the price of slightly increasing the flow of small jobs.