ROMA

ROMA working groups — year 2020/2021

How to join:
The working groups take place online using the video-conference tool provided by ENS-Lyon: Next working groups:

April 29th, 2021, 3pm.

Anthony Dugois: Taming Tail Latency in Key-Value Stores: a Scheduling Perspective.

Distributed key-value stores employ replication for high availability. Yet, they do not always efficiently take advantage of the availability of multiple replicas for each value, and read operations often exhibit high tail latencies. Various replica selection strategies have been proposed to address this problem, together with local request scheduling policies. It is difficult, however, to determine what is the absolute performance gain each of these strategies can achieve. We present a formal framework allowing the systematic study of request scheduling strategies in key-value stores. We contribute a definition of the optimization problem related to reducing tail latency in a replicated key-value store as a minimization problem with respect to the maximum weighted flow criterion. By using scheduling theory, we show the difficulty of this problem, and therefore the need to develop performance guarantees. We also study the behavior of heuristic methods using simulations, which highlight which properties are useful for limiting tail latency: for instance, the EFT strategy — which uses the earliest available time of servers — exhibits a tail latency that is less than half that of state-of-the-art strategies, often matching the lower bound. Our study also emphasizes the importance of metrics such as the stretch to properly evaluate replica selection and local execution policies.


Previous working groups:

April 1st, 2021, 3pm.

Jules Bertrand: Algorithms and data structures for hyperedge queries.

We consider the problem of querying the existence of hyperedges in hypergraphs. More formally, we are given a hypergraph, and we need to answer queries of the form “does the following set of vertices form a hyperedge in the given hypergraph?”. Our aim is to set up data structures based on hashing to answer these queries as fast as possible. We propose an adaptation of a well-known perfect hashing approach for the problem at hand. We analyze the space and run time complexity of the proposed approach, and experimentally compare it with the state of the art hashing-based solutions. Experiments demonstrate that the proposed approach has shorter query response time than the other considered alternatives, while having the shortest or the second shortest construction time.


March 18, 2021, 3pm.

Yves Robert: Distributed-memory multi-GPU block-sparse tensor contraction for electronic structure.

Many domains of scientific simulation (chemistry, condensed matter physics, data science) increasingly eschew dense tensors for block-sparse tensors, sometimes with additional structure (recursive hierarchy, rank sparsity, etc.). Distributed-memory parallel computation with block-sparse tensorial data is paramount to minimize the time-tosolution (e.g., to study dynamical problems or for real-time analysis) and to accommodate problems of realistic size that are too large to fit into the host/device memory of a single node equipped with accelerators. Unfortunately, computation with such irregular data structures is a poor match to the dominant imperative, bulk-synchronous parallel programming model. In this paper, we focus on the critical element of block-sparse tensor algebra, namely binary tensor contraction, and report on an efficient and scalable implementation using the task-focused PaRSEC runtime. High performance of the block-sparse tensor contraction on the Summit supercomputer is demonstrated for synthetic data as well as for real data involved in electronic structure simulations of unprecedented size.


March 11, 2021, 3pm.

Maxime Gonthier: Locality-Aware Scheduling of Independant Tasks for Runtime Systems.

A now-classical way of meeting the increasing demand for computing speed by HPC applications is the use of GPUs and/or other accelerators. Such accelerators have their own memory, which is usually quite limited, and are connected to the main memory through a bus with bounded bandwidth. Thus, a particular care should be devoted to data locality in order to avoid unnecessary data movements. Task-based runtime schedulers have emerged as a convenient and efficient way to use such heterogeneous platforms. When processing an application, the scheduler has the knowledge of all tasks available for processing on a GPU, as well as their input data dependencies. Hence, it is able to order tasks and prefetch their input data in the GPU memory (after possibly evicting some previously-loaded data), while aiming at minimizing data movements, so as to reduce the total processing time. In this paper, we focus on how to schedule tasks that share some of their input data (but are otherwise independent) on a GPU. We provide a formal model of the problem, exhibit an optimal eviction strategy, and show that ordering tasks to minimize data movement is NP-complete. We review and adapt existing ordering strategies to this problem, and propose a new one based on task aggregation. These strategies have been implemented in the StarPU runtime system, which allows to test them on a variety of linear algebra problems. Our experiments demonstrate that using our new strategy together with the optimal eviction policy reduces the amount of data movement as well as the total processing time.

March 4, 2021, 3pm.

Redouane Elghazi: Max-stretch minimization on an edge-cloud platform.

We consider the problem of scheduling independent jobs that are generated by processing units at the edge of the network. These jobs can either be executed locally, or sent to a centralized cloud platform that can execute them at greater speed. Such edge-generated jobs may come from various applications, such as e-health, disaster recovery, autonomous vehicles or flying drones. The problem is to decide where and when to schedule each job, with the objective to minimize the maximum stretch incurred by any job. The stretch of a job is the ratio of the time spent by that job in the system, divided by the minimum time it could have taken if the job was alone in the system. We formalize the problem and explain the differences with other models that can be found in the literature. We prove that minimizing the max-stretch is NP-complete, even in the simpler instance with no release dates (all jobs are known in advance). This result comes from the proof that minimizing the max-stretch with homogeneous processors and without release dates is NP-complete, a complexity problem that was left open before this work. We design several algorithms to propose efficient solutions to the general problem, and we conduct simulations based on real platform parameters to evaluate the performance of these algorithms.


February 4, 2021, 3pm.

Valentin Honoré: Scheduling stochastic jobs on reservation-based platforms.

With the expected convergence between HPC, BigData and AI, new applications with different profiles are coming to HPC. Among them, stochastic jobs are jobs whose execution time cannot be determined easily. They arise from the heterogeneous, dynamic and data-intensive requirements of new emerging fields such as neuroscience.
However, the uncertainty of their execution time remains a strong limitation when using supercomputers. Indeed, the user needs to estimate how long his job will have to be executed by the machine, and enters this estimation as his/her first reservation value. But if the job does not complete successfully within this first reservation, the user will have to resubmit the job, this time requiring a longer reservation. In the end, the total cost for the user will be the overall cost of all the reservations that were necessary to achieve the successful completion of the job.
In this talk, I propose an overview of different contributions for scheduling stochastic jobs on reservation-based platforms.
I will present scheduling contributions for stochastic jobs under the form of reservation strategies. A reservation strategy determines a sequence of increasing- length reservations, which are paid for until one of them allows the job to successfully complete. The goal is to minimize the total expected cost of the strategy. We derived strategies including checkpointing at the end of some (well-chosen) reservations, to avoid wasting the benefits of failed reservations.
I will then discuss about the applicability of the strategies presented above. By performing an in-depth profiling of a representative stochastic application, we will show the limitations of the previous approaches and describe adapted strategies that better fit the properties of applications.
Joint work with Guillaume Pallez (Aupy), Brice Goglin, Yves Robert and Ana Gainaru.


January 28, 2021, 3pm.

Suraj Kumar: Parallel Tensor Train through Hierarchical Decomposition.

We consider the problem of developing parallel decomposition and approximation algorithms for high dimensional tensors. We focus on a tensor representation named Tensor Train (TT). It stores a d-dimensional tensor in O(ndr^2), much less than the O(n^d) entries in the original tensor, where 'r' is usually a very small number and depends on the application. Sequential algorithms to compute TT decomposition and TT approximation of a tensor have been proposed in the literature. We propose a parallel algorithm to compute TT decomposition of a tensor. We prove that the ranks of TT-representation produced by our algorithm are bounded by the ranks of unfolding matrices of the tensor. We also propose a parallel algorithm to compute approximation of a tensor in TT-representation. Our algorithm relies on a hierarchical partitioning of the dimensions of the tensor in a balanced binary tree shape and transmission of leading singular values of associated unfolding matrix from the parent to its children. We consider several approaches on the basis of how leading singular values are transmitted in the tree. We present an in-depth experimental analysis of our approaches for different low rank tensors. Our results show that the approach which transmits leading singular values to both of its children performs better in practice.


January 21, 2021, 3pm.

Frédéric Vivien: Resource-Constrained Scheduling of Stochastic Tasks With Unknown Probability Distribution.

This work introduces scheduling strategies to maximize the expected numberof independent tasks that can be executed on a cloud platform within a given budgetand under a deadline constraint. Task execution times are not known before execution;instead, the only information available to the scheduler is that they obey some (unknown) probability distribution. The scheduler needs to acquire some information before decidingfor a cutting threshold: instead of allowing all tasks to run until completion, one maywant to interrupt long-running tasks at some point. In addition, the cutting thresholdmay be reevaluated as new information is acquired when the execution progresses further.This works presents several strategies to determine a good cutting threshold, and to decidewhen to re-evaluate it. In particular, we use the Kaplan-Meier estimator to account fortasks that are still running when making a decision. The efficiency of our strategies isassessed through an extensive set of simulations with various budget and deadline values,and ranging over 14 probability distributions.


January 7, 2021, 3pm.

Lucas Pérotin: Resilient Scheduling of Moldable Jobs on Failure-Prone Platforms.

This paper focuses on the resilient scheduling of moldable parallel jobs on high-performance computing (HPC) platforms. Moldable jobs allow for choosing a processor allocation before execution, and their execution time obeys various speedup models. The objective is to minimize the overall completion time of the jobs, or makespan, assuming that jobs are subject to arbitrary failure scenarios, and hence need to be re-executed each time they fail until successful completion. This work generalizes the classical framework where jobs are known offline and do not fail. We introduce a list-based algorithm, and prove new approximation ratios for three prominent speedup models (roofline, communication, Amdahl). We also introduce a batch-based algorithm, where each job is allowed a restricted number of failures per batch, and prove a new approximation ratio for the arbitrary speedup model. We conduct an extensive set of simulations to evaluate and compare different variants of the two algorithms. The results show that they consistently outperform some baseline heuristics. In particular, the list algorithm performs better for the roofline and communication models, while the batch algorithm has better performance for the Amdahl's model. Overall, our best algorithm is within a factor of 1.47 of a lower bound on average over the whole set of experiments, and within a factor of 1.8 in the worst case.


December 17, 2020, 3pm.

Yishu Du: Robustness of the Young/Daly formula for stochastic iterative applications

The Young/Daly formula for periodic checkpointing is known to hold fora divisible load application where one can checkpoint at any time-step. In an nutshell, the optimal period isPYD=√2μfC where μf is the Mean Time Between Failures (MTBF) and C is the checkpoint time. This paper assesses the accuracy of the formulafor applications decomposed into computational iterations where: (i) the duration of aniteration is stochastic, i.e., obeys a probability distribution law D of mean μD ; and (ii) onecan checkpoint only at the end of an iteration. We first consider static strategies where checkpoints are taken after a given number of iterationskand provide a closed-form,asymptotically optimal, formula fork, valid for any distribution D. We then show that using the Young/Daly formula to compute k (ask·μD=PYD) is a first order approximationof this formula. We also consider dynamic strategies where one decides to checkpoint at the end of an iteration only if the total amount of work since the last checkpoint exceeds a threshold Wth, and otherwise proceed to the next iteration. Similarly, we provide aclosed-form formula for this threshold and show that PYD is a first-order approximation of Wth. Finally, we provide an extensive set of simulations where D is either Uniform,Gamma or truncated Normal, which shows the global accuracy of the Young/Daly formula,even when the distribution D had a large standard deviation (and when one cannot use afirst-order approximation). Hence we establish that the relevance of the formula goes well beyond its original framework

December 10, 2020, 3pm.

Yikin Gao: Energy-aware strategies for reliability-oriented real-time task allocation on heterogeneous platforms

Low energy consumption and high reliability are widely identified as increasingly relevant issues in real-time systems on heterogeneous platforms. In this paper, we propose a multi-criteria optimization strategy to minimize the expected energy consumption while enforcing the reliability threshold and meeting all task deadlines. The tasks are replicated to ensure a prescribed reliability threshold. The platforms are composed of processors with different (and possibly unrelated) characteristics, including speed profile, energy cost and failure rate. We provide several mapping and scheduling heuristics towards this challenging optimization problem. Specifically, a novel approach is designed to control (i) how many replicas to use for each task, (ii) on which processor to map each replica and (iii) when to schedule each replica on its assigned processor. Different mappings achieve different levels of reliability and consume different amounts of energy. Scheduling matters because once a task replica is successful, the other replicas of that task are cancelled, which calls for minimizing the amount of temporal overlap between any replica pair. The experiments are conducted for a comprehensive set of execution scenarios, with a wide range of processor speed profiles and failure rates. The comparison results reveal that our strategies perform better than the random baseline, with a gain of 40% in energy consumption, for nearly all cases. The absolute performance of the heuristics is assessed by a comparison with a lower bound; the best heuristics achieve an excellent performance, with an average value only 4% higher than the lower bound.

November 19, 2020, 3pm.

Grégoire Pichon: Trading Performance for Memory in Sparse Direct Solvers using Low-rank Compression

Sparse direct solvers using Block Low-Rank compression have been proven efficient to solve problems arising in many real-life applications. Improving those solvers is crucial for being able to 1) solve larger problems and 2) speed up computations. A main characteristic of a sparse direct solver using low-rank compression is when compression is performed. There are two distinct approaches: (1) all blocks are compressed before starting the factorization, which reduces the memory as much as possible, or (2) each block is compressed as late as possible, which usually leads to better speedup. The objective of this paper is to design a composite approach, to speedup computations while staying under a given memory limit. This should allow to solve large problems that cannot be solved with Approach 2 while reducing the execution time compared to Approach 1. We propose a memory-aware strategy where each block can be compressed either at the beginning or as late as possible. We first consider the problem of choosing when to compress each block, under the assumption that all information on blocks is perfectly known, i.e., memory requirement and execution time of a block when compressed or not. We show that this problem is a variant of the NP-complete Knapsack problem, and adapt an existing 2-approximation algorithm for our problem. Unfortunately, the required information on blocks depends on numerical properties and in practice cannot be known in advance. We thus introduce models to estimate those values. Experiments on the PaStiX solver demonstrate that our new approach can achieve an excellent trade-off between memory consumption and computational cost. For instance on matrix Geo1438, Approach 2 uses three times as much memory as Approach 1 while being three times faster. Our new approach leads to an execution time only 30% larger than Approach 2 when given a memory 30% larger than the one needed by Approach 1.