Associated-team ALOHA

Team composition

The ALOHA associated-team is between the ROMA team and researchers from the Information and Computer Sciences Department of the University of Hawai'i at Manoa. The researchers involved are:

ROMA team:
- Permanent staff members: Anne Benoit, Loris Marchal, Yves Robert, Bora Uçar, and Frédéric Vivien
- PhD students: Guillaume Aupy and Dounia Zaidouni
University of Hawai'i:
- Permanent staff members: Henri Casanova and Lipyeow Lim
- PhD students: Benjamin Karsin and Tianli Mo

Databases are pervasive in our digital world, providing a foundational technology for countless applications in industry and academia. Database research explores the many remaining questions that must be answered to improve current usage and enable new use cases for database systems. These questions span the whole spectrum from theory to systems, essentially bearing strong connections to most areas of computer science. Indeed, beyond the traditional context of centralized relational databases that support a query language, database researchers are facing the challenges posed by large-scale distributed database systems, multi-media databases, mobile (querying) devices, the need for on-line query streaming, new database formats and new query needs, data security and persistence, with a common concern for achieving high performance, high availability, and low cost. Addressing these new challenges requires novel database problem formulations, as well as novel algorithm and systems approaches.

In this proposed work, our overall objective is to capitalize on fundamental research results achieved over the years by some of the project participants in the area of parallel and distributed computing, and more particularly application scheduling. Building on a vast array of theoretical techniques and expertise developed in this field, we will tackle database questions from a fresh perspective. To this end, this proposal includes:

a group that specializes in database systems research and who has both industrial and academic experience;
a group that specializes in practical aspects of scheduling problems and in simulation for emerging platforms and applications, and who has a long experience of multidisciplinary research;
a group that specializes in the theoretical aspects of scheduling problems and resource management.

Given the breadth of database research as a whole, our work will focus on three distinct thrusts:

On-line, multi-criteria query optimization
Fault-Tolerance for distributed databases
Query scheduling for distributed databases

Our hope, however, is that a large fraction of the obtained research results will be generalizable and applicable to other areas within the context of databases.

Recent activities

Meetings and visits

Workshop organization (July 2015)

Henri Casanova from the University of Hawai'i at Manoa, Ewa Deelman from the University of Southern California - Marina del Rey (USA), , Yves Robert from the ROMA team, and Uwe Schwiegelshohn from TU Dortmund University (Germany) will co-organize the Dagstuhl Seminar 15281 on "Algorithms and Scheduling Techniques to Manage Resilience and Power Consumption in Distributed Systems" which will take place in Wadern, Germany, from July 5 to July 10, 2015.

Fall 2014 visit to the American partner

Frédéric Vivien visited the American partner from October 19 to October 29, 2014.

Workshop organization (July 2014)

Henri Casanova from the University of Hawai'i at Manoa, Yves Robert and Frédéric Vivien from the ROMA team co-organized the 9th Scheduling for Large Scale Systems workshop which took place in Lyon, France, from July 1 to July 4, 2014, and gathered 54 researchers (including all PhD students of the ROMA team).

Winter 2014 visit to the American partner

Yves Robert visited the American partner for one week from March 1 to March8, 2014.

Fall 2013 visit to the American partner

Guillaume Aupy (PhD student) and Anne Benoit (his PhD advisor) will visit the American partner during the second week of December, 2013, Anne Benoit from November 27 to December 2 and Guillaume Aupy from December 5 through December 15.

Workshop organization (September 2013)

Henri Casanova from the University of Hawai'i at Manoa, Yves Robert from the ROMA team, and Uwe Schwiegelshohn from TU Dortmund University (Germany) co-organized the Dagstuhl Seminar 13381 on "Algorithms and Scheduling Techniques for Exascale Systems" which took place in Wadern, Germany, from September 15 to September 20, 2013, and gathered 41 researchers, including two PhD students of the ROMA team and a future PhD student, along with 5 permanent researchers from the ROMA team.

Summer 2013 visit to the American partner

Yves Robert and Frédéric Vivien spent the week from Thursday, June 27 to Friday July 5 visiting the American partner.

Fall 2012 visit to the American partner

Yves Robert and Frédéric Vivien spent the week from November 4 to November 11 visiting the American partner.

Student internship (March-July 2012)

Thomas Lambert, a first-year master student at École normale supérieure de Lyon did an internship in the scope of the associated-team. Thomas spent three months in the ROMA team working under the supervision of Loris Marchal and Bora Uçar, and two months in the CoRG group working under the supervision of Henri Casanova. He worked on Allocation of Memory for Different Classes of DAGs.

Workshop organization (June 2012)

Henri Casanova from the University of Hawai'i at Manoa, Yves Robert and Frédéric Vivien from the ROMA team, and Rami Melhem and Taieb Znati from the University of Pittsburgh co-organized the 7th Scheduling for Large Scale Systems Workshop which took place in Pittsburgh, from June 28 to June 30, 2012, and gathered 25 researchers, including two PhD students of the ROMA team.

Scientific achievements

Scheduling computational workflows on failure-prone platforms

In this work, we study the scheduling of computational workflows on compute resources that experience exponentially distributed failures. When a failure occurs, rollback and recovery is used to resume the execution from the last checkpointed state. The scheduling problem is to minimize the expected execution time by deciding in which order to execute the tasks in the workflow and whether to checkpoint or not checkpoint a task after it completes. We give a polynomial-time algorithm for fork graphs and show that the problem is NP-complete with join graphs. Our main result is a polynomial-time algorithm to compute the execution time of a workflow with specified to-be-checkpointed tasks. Using this algorithm as a basis, we propose efficient heuristics for solving the scheduling problem. We evaluate these heuristics for representative workflow configurations.

This work was submitted to the IPDPS 2015 conference.

Cost-Optimal Execution of Trees of Boolean Operators with Shared Streams

The processing of queries expressed as trees of boolean operators applied to predicates on sensor data streams has several applications in mobile computing. Sensor data must be retrieved from the sensors to a query processing device, such as a smartphone, over one or more network interfaces. Retrieving a data item incurs a cost, e.g., an energy expense that depletes the smartphone's battery. Since the query tree contains boolean operators, part of the tree can be shortcircuited depending on the retrieved sensor data. An interesting problem is to determine the order in which predicates should be evaluated so as to minimize the expected query processing cost.

This problem has been studied in previous work assuming that each data stream occurs in a single predicate. In this work we remove this assumption since it does not necessarily hold for real-world queries. Our main results are a novel optimal algorithm for single-level trees and a proof of NP-completeness for DNF trees. For DNF trees, however, we show that there is an optimal predicate evaluation order that corresponds to a depth-first traversal. This result provides inspiration for a class of heuristics. We show that one of these heuristics largely outperforms other sensible heuristics, including the one heuristic proposed in previous work for our general version of the query processing problem.

This work was accepted for publication by the IPDPS 2014 conference.

We extended this work to the case where a single predicate can acquire data from several streams, the so-called multi-stream case. In this context, even finding the best schedule for a single-level tree in NP-complete. We proposed a scheduling heuristic for single-level trees that extend the optimal algorithm for the single-stream case. For DNF trees, however, we showed that there is still an optimal predicate evaluation order that corresponds to a depth-first traversal. Once again, we designed heuristics based on that property.

This work was submitted to the journal Artificial Intelligence.

Allocation of Memory for Different Classes of DAGs

We study the complexity of traversing workflows whose tasks require large I/O files. Such workflows arise in many scientific fields, such as image processing, genomics or geophysical simulations. They usually exhibit some regularity, and most of them can be modeled as Series-Parallel Graph. We target a classical two-level memory system, where the main memory is faster but smaller than the secondary memory. A task in the workflow can be processed if all its predecessors have been processed, and if its input and output files fit in the currently available main memory. The amount of available memory at a given time depends upon the ordering in which the tasks are executed. We focus on the problem of minimizing the amount of main memory needed to process the whole DAG.

We first concentrate on the parallel composition of task chains, or fork-join graphs. We adapt an algorithm designed for trees by J. Liu: we prove that an optimal schedule for fork-join can be split in two optimal tree schedules, which are obtained using Liu's algorithm. We then move to Series-Parallel graphs and propose a recursive adaptation of the previous algorithm, which consists in serializing every parallel compositions, starting from the innermost, using the fork-join algorithm. Simulations show that this algorithm always reach the optimal performance, and we provide a sketch of the optimality proof. We also study compositions of complete bipartite graphs, which are another important class of DAGs arising in scientific workflows. We propose an optimal algorithm for a class of compositions which we name tower of complete bipartite graphs.

Combining Process Replication and Checkpointing for Resilience on Exascale Systems

Processor failures in post-petascale settings are common occurrences. The traditional fault-tolerance solution, checkpoint-rollback, severely limits parallel efficiency. One solution is to replicate application processes so that a processor failure does not necessarily imply an application failure. Process replication, combined with checkpoint-rollback, has been recently advocated by Ferreira et al. We first identify an incorrect analogy made in their work between process replication and the birthday problem, and derive correct values for the Mean Number of Failures To Interruption and Mean Time To Interruption for exponentially distributed failures. We then extend these results to arbitrary failure distributions, including closed-form solutions for Weibull distributions. Finally, we evaluate process replication using both synthetic and real-world failure traces. Our main findings are: (i) replication is less beneficial than claimed by Ferreira et al; (ii) although the choice of the checkpointing period can have a high impact on application execution in the no-replication case, with process replication this choice is no longer critical.

This work was submitted to the International Journal of High Performance Computing Applications.

Mapping Tightly-Coupled Applications on Volatile Resources

Platforms that comprise volatile processors, such as desktop grids, have been traditionally used for executing independent-task applications. In this work we study the scheduling of tightly-coupled iterative master-worker applications onto volatile processors. The main challenge is that workers must be simultaneously available for the application to make progress. We consider two additional complications: one should take into account that workers can become temporarily reclaimed and, for data-intensive applications, one should account for the limited bandwidth between the master and the workers. In this context, our first contribution is a theoretical study of the scheduling problem in its off-line version, i.e., when processor availability is known in advance. Even in this case the problem is NP-hard. Our second contribution is an analytical approximation of the expectation of the time needed by a set of workers to complete a set of tasks and of the probability of success of this computation. This approximation relies on a Markovian assumption for the temporal availability of processors. Our third contribution is a set of heuristics, some of which use the above approximation to favor reliable processors in a sensible manner. We evaluate these heuristics in simulation. We identify some heuristics that significantly outperform their competitors and derive heuristic design guidelines.

Part of this work was accepted for publication by the HCW 2013 workshop.

Part of this work was accepted for publication by the PDP 2013 conference.

This work, combined with a previous work targeting independent tasks, was accepted for publication by the International Journal of High Performance Computing Applications.

Using group replication for resilience on exascale systems

High performance computing applications must be resilient to faults, which are common occurrences especially in post-petascale settings. The traditional fault-tolerance solution is checkpoint-recovery, by which the application saves its state to secondary storage throughout execution and recovers from the latest saved state in case of a failure. An oft studied research question is that of the optimal checkpointing strategy: when should state be saved? Unfortunately, even using an optimal checkpointing strategy, the checkpointing frequency must increase as platform scale increases, leading to higher checkpointing overhead. This overhead precludes high parallel efficiency for large-scale platforms, thus mandating other more scalable fault-tolerance mechanisms. One such mechanism is replication, which can be used in addition to checkpoint-recovery. Using replication, multiple processors perform the same computation so that a processor failure does not necessarily imply application failure. While at first glance replication may seem wasteful, it may be significantly more efficient than using solely checkpoint-recovery at large scale. In this work we investigate a simple approach where entire application instances are replicated. We provide a theoretical study of checkpoint-recovery with replication in terms of expected application execution time, under an exponential distribution of failures. We design dynamic-programming based algorithms to define checkpointing dates that work under any failure distribution. We also conduct simulation experiments assuming that failures follow Exponential or Weibull distributions, the latter being more representative of real-world systems, and using failure logs from production clusters. Our results show that replication is useful in a variety of realistic application and checkpointing cost scenarios for future exascale platforms.

This work was accepted for publication by the International Journal of High Performance Computing Applications.

Publications

Book chapter

Henri Casanova, Frédéric Vivien, and Dounia Zaidouni. Using replication for resilience on exascale systems, in Fault-tolerant techniques for High-Performance Computing, Springer, 2015, to appear.

Journal articles

Marin Bougeret, Henri Casanova, Yves Robert, Frédéric Vivien, and Dounia Zaidouni. Using group replication for resilience on exascale systems, International Journal of High Performance Computing Applications, 28(2):210-224, May 2014. [DOI] [WWW] [PDF]
Henri Casanova, Fanny Dufossé, Yves Robert, and Frédéric Vivien. Mapping Applications on Volatile Resources, International Journal of High Performance Computing Applications, 2014. To appear. [DOI] [WWW] [PDF]

Conferences

Henri Casanova, Fanny Dufossé, Yves Robert, and Frédéric Vivien. Scheduling Tightly-Coupled Applications on Heterogeneous Desktop Grids. HCW 2013 - 22nd International Heterogeneity in Computing Workshop, Boston, MA, USA, May 20, 2013.

WWW

PDF

Henri Casanova, Fanny Dufossé, Yves Robert, and Frédéric Vivien. Mapping Tightly-Coupled Applications on Volatile Resources. PDP 2013 (21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing), Belfast, Northern Ireland, 27th February - 1st March 2013. [WWW] [PDF]
Henri Casanova, Lipyeow Lim, Yves Robert, Frédéric Vivien, and Dounia Zaidouni. Cost-Optimal Execution of Boolean Query Trees with Shared Streams. IPDPS 2014 - the 28th IEEE International Parallel &ersand; Distributed Processing Symposium, May 2014, Phoenix, United States. [DOI] [WWW] [PDF]

Research reports

Henri Casanova, Lipyeow Lim, Yves Robert, Frédéric Vivien, and Dounia Zaidouni. Cost-Optimal Execution of Boolean DNF Trees with Shared Streams. Research report RR-8616, INRIA, November 2014. [WWW] [PDF]
Guillaume Aupy , Anne Benoit, Henri Casanova, and Yves Robert. Scheduling computational workflows on failure-prone platforms . Research report RR-8609, INRIA, October 2014. [WWW] [PDF]
Henri Casanova, Lipyeow Lim, Yves Robert, Frédéric Vivien, and Dounia Zaidouni. Cost-Optimal Execution of Trees of Boolean Operators with Shared Streams. Research report RR-8373, INRIA, October 2013. [WWW] [PDF]
Marin Bougeret, Henri Casanova, Yves Robert, Frédéric Vivien, and Dounia Zaidouni. Using group replication for resilience on exascale systems. Research report RR-7876, INRIA, February 2012. [WWW] [PDF]
Henri Casanova, Fanny Dufossé, Yves Robert, and Frédéric Vivien. Mapping Tightly-Coupled Applications on Volatile Resources. Research report, May 2012. [WWW] [PDF]
Henri Casanova, Yves Robert, Frédéric Vivien, and Dounia Zaidouni. Combining Process Replication and Checkpointing for Resilience on Exascale Systems. Research report RR-7951, INRIA, May 2012. [WWW] [PDF]

ROMA