The working groups take place online using the video-conference tool provided by ENS-Lyon:

- Connect to https://ent-services.ens-lyon.fr/entVisio
- Select the meeting room: ROMA-WG
- To get the password, drop a mail to suraj.kumar at ens-lyon.fr

Mar 28, 2023, 10:30 AM.

Timothée David--Cléris: SHAMROCK — a general framework for astrophysical hydro simulations targeting multi-GPU architectures using SYCL.

We will present the ongoing development in SHAMROCK, a framework/code designed to handle various hydrodynamical schemes using both Smoothed Particle Hydrodynamics and Adaptive Mesh Refinement. We will first present the tree algorithm: a significant speedup is achieved by using binary arithmetics and morton codes, allowing an almost free recomputation. We will then present our MPI strategy for multi-GPU scheduling.

Previous working groups:

Jan 24, 2023, 10:30 AM.

Theo Mary: Combining mixed precision and low-rank approximations for the solution of sparse linear systems.

Low-rank approximations are a powerful tool that has been shown quite effective to solve large sparse systems of linear equations. At the same time, the emergence of low precision floating-point arithmetics in modern hardware motivates the development of mixed precision algorithms, which combine the speed of low precision arithmetics with the accuracy of high precision ones. In this talk, I will review two different approaches to combine mixed precision and low-rank approximations: a coarse-grain approach based on iterative refinement, and a fine-grain approach based on adaptive precision. If time allows, I will briefly comment on ongoing work combining the two types of approaches and extending these techniques to tensors.

March 24, 2022, 3pm.

Somesh Singh: Investigations on the use of Hashing for Parallel Graph and Hypergraph Processing.

Graphs and hypergraphs are widely used to model a multitude of data science and scientific computing applications. This is so as these two discrete mathematical objects correspond naturally to sparse matrices and tensors. We investigate the use of hashing for implementing operations on graphs and hypergraphs in shared memory systems, with an eye towards linear and multi-linear algebraic operations on matrices and tensors.

March 17, 2022, 3pm.

Bertrand Simon: Learning-Augmented Online Algorithms.

Guaranteed classical online algorithms are often designed in order to minimize the competitive ratio, i.e., the performance in the worst case, so usually do not perform well on easy inputs. An orthogonal approach has been developed in the last decade thanks to the progress in machine learning, which allows to predict key parameters of the instance. Its downside is that no guarantee can be given on the prediction quality, for instance if the training set does not represent the current instance. This statement is at the origin of the recently emerging field of learning-augmented algorithms. In this framework, an online algorithm has access to a predictor, which can give relevant information about the future instance, though without any guarantee on the quality. The objective is then to design an algorithm which performs well when the predictor turns out to be accurate, while being robust to imprecise predictions. This talk will introduce this new domain, present some recent works and discuss current research questions.

March 10, 2022, 3pm.

Frédéric Vivien: Scheduling Strategies for Overloaded Real-Time Systems.

This work introduces and assesses novel strategies to schedule firm real-time jobs on an overloaded server. The jobs are released periodically and have the same relative deadline. Job execution times obey an arbitrary probability distribution and can take unbounded values (no WCET). We introduce three control parameters to decide when to start or interrupt a job. We couple this dynamic scheduling with several admission policies and investigate several optimization criteria, the most prominent being the Deadline Miss Ratio (DMR). Then we derive a Markov model and use its stationary distribution to determine the best value of each control parameter. Finally we conduct an extensive simulation campaign with 14 different probability distributions; the results nicely demonstrate how the new control parameters help improve system performance compared with traditional approaches. In particular, we show that (i) the best admission policy is to admit all jobs; (ii) the key control parameter is to upper bound the start time of each job; (iii) the best scheduling strategy decreases the DMR by up to 0.35 over traditional competitors.

Dec. 7, 2021, 10:30am.

Suraj Kumar: Communication Optimal Algorithms for Multiple Tensor-Times-Matrix Computation.

Multiple Tensor Times Matrix (Multi-TTM) is a key computation in algorithms for computing the Tucker tensor decomposition, which is frequently used in multidimensional data analysis. We establish communication lower bounds that determine how much data movement is required to perform 3-dimensional Multi-TTM computation in parallel. We solve optimization problems with nonlinear constraints to establish the bounds. We also present a parallel algorithm to perform this computation that organizes the processors into a 6-dimensional logical grid. We show that with correct choices of grid dimensions, the communication cost of the algorithm attains the lower bounds and is therefore communication optimal. We also show that, for some instances, our algorithm reduces communication significantly compared to the straightforward approach of expressing the computation as a sequence of matrix multiplication operations.

Nov. 23, 2021, 10am.

Valentin Honoré: SIM-SITU : A Framework for the Faithful Simulation of in-situ Workflows.

The amount of data generated by numerical simulations in various scientific domains such as molecular dynamics, climate modeling, biology, or astrophysics, led to a fundamental redesign of application workflows. The throughput and the capacity of storage subsystems have not evolved as fast as the computing power in extreme-scale supercomputers. As a result, the classical post-hoc analysis of simulation outputs became highly inefficient. In-situ workflows have then emerged as a solution in which simulation and data analytics are intertwined through shared computing resources, thus lower latencies. Determining the best allocation, i.e., how many resources to allocate to each component of an in-situ workflow; and mapping, i.e., where and at which frequency to run the data analytics component, is a complex task whose performance assessment is crucial to the efficient execution of in-situ workflows. However, such a performance evaluation of different allocation and mapping strategies usually relies either on directly running them on the targeted execution environments, which can rapidly become extremely time- and resource-consuming, or on resorting to the simulation of simplified models of the components of an in-situ workflow, which can lack of realism. In both cases, the validity of the performance evaluation is limited. To address this issue, we introduce SIM-SITU , a framework for the faithful simulation of in-situ workflows. This framework builds on the SimGrid toolkit and benefits of several important features of this versatile simulation tool. We designed SIM-SITU to reflect the typical structure of in-situ workflows and thanks to its modular design, SIM-SITU has the necessary flexibility to easily and faithfully evaluate the behavior and performance of various allocation and mapping strategies for in-situ workflows. We illustrate the simulation capabilities of SIM-SITU on a Molecular Dynamics use case. We study the impact different allocation and mapping strategies on performance and show how users can leverage to determine interesting tradeoffs when designing their in-situ workflow.

Oct. 22, 2021, 11am.

Redouane Elghazi: Shelf schedules for independent moldable tasks to minimize the energy consumption.

Scheduling independent tasks on a parallel platform is a widely studied problem, in particular when the goal is to minimize the total execution time, or makespan (P||Cmax problem in Graham's notations). Also, many applications do not consist of sequential tasks, but rather parallel moldable tasks that can decide their degree of parallelism at execution (i.e., on how many processors they are executed). Furthermore, since the energy consumption of data centers is a growing concern, both from an environmental and economical point of view, minimizing the energy consumption of a schedule is a main challenge to be addressed. One can then decide, for each task, on how many processors it is executed, and at which speed the processors are operated, with the goal to minimize the total energy consumption. We further focus on co-schedules, where tasks are partitioned into shelves, and we prove that the problem of minimizing the energy consumption remains NP-complete when static energy is consumed during the whole duration of the application. We are however able to provide an optimal algorithm for the schedule within one shelf, i.e., for a set of tasks that start at the same time. Several approximation results are derived, and simulations are performed to show the performance of the proposed algorithms.

July 1st, 2021, 3pm.

Redouane Elghazi: Update on the Asymptotic Optimality of LPT.

When independent tasks are to be scheduled onto identical processors, the typical goal is to minimize the makespan. A simple and efficient heuristic consists in scheduling the tasks by descending order of processing time, starting the next task by this order every time a processor finishes a task (LPT heuristic). While the performance of LPT has already been largely studied, in particular its asymptotic performance, we revisit results and propose a novel analysis for the case of tasks generated through uniform integer compositions. Also, we perform extensive simulations to empirically assess the asymptotic performance of LPT. Results demonstrate that the absolute error rapidly tends to zero for several distributions of task costs, including ones studied by theoretical models, and realistic distributions coming from benchmarks.

May 3rd, 2021, 3pm.

Kamer Kaya: A Gentle introduction to Blockchain.

April 29th, 2021, 3pm.

Anthony Dugois: Taming Tail Latency in Key-Value Stores: a Scheduling Perspective.

Distributed key-value stores employ replication for high availability. Yet, they do not always efficiently take advantage of the availability of multiple replicas for each value, and read operations often exhibit high tail latencies. Various replica selection strategies have been proposed to address this problem, together with local request scheduling policies. It is difficult, however, to determine what is the absolute performance gain each of these strategies can achieve. We present a formal framework allowing the systematic study of request scheduling strategies in key-value stores. We contribute a definition of the optimization problem related to reducing tail latency in a replicated key-value store as a minimization problem with respect to the maximum weighted flow criterion. By using scheduling theory, we show the difficulty of this problem, and therefore the need to develop performance guarantees. We also study the behavior of heuristic methods using simulations, which highlight which properties are useful for limiting tail latency: for instance, the EFT strategy — which uses the earliest available time of servers — exhibits a tail latency that is less than half that of state-of-the-art strategies, often matching the lower bound. Our study also emphasizes the importance of metrics such as the stretch to properly evaluate replica selection and local execution policies.

April 1st, 2021, 3pm.

Jules Bertrand: Algorithms and data structures for hyperedge queries.

We consider the problem of querying the existence of hyperedges in hypergraphs. More formally, we are given a hypergraph, and we need to answer queries of the form “does the following set of vertices form a hyperedge in the given hypergraph?”. Our aim is to set up data structures based on hashing to answer these queries as fast as possible. We propose an adaptation of a well-known perfect hashing approach for the problem at hand. We analyze the space and run time complexity of the proposed approach, and experimentally compare it with the state of the art hashing-based solutions. Experiments demonstrate that the proposed approach has shorter query response time than the other considered alternatives, while having the shortest or the second shortest construction time.

March 18, 2021, 3pm.

Yves Robert: Distributed-memory multi-GPU block-sparse tensor contraction for electronic structure.

Many domains of scientific simulation (chemistry, condensed matter physics, data science) increasingly eschew dense tensors for block-sparse tensors, sometimes with additional structure (recursive hierarchy, rank sparsity, etc.). Distributed-memory parallel computation with block-sparse tensorial data is paramount to minimize the time-tosolution (e.g., to study dynamical problems or for real-time analysis) and to accommodate problems of realistic size that are too large to fit into the host/device memory of a single node equipped with accelerators. Unfortunately, computation with such irregular data structures is a poor match to the dominant imperative, bulk-synchronous parallel programming model. In this paper, we focus on the critical element of block-sparse tensor algebra, namely binary tensor contraction, and report on an efficient and scalable implementation using the task-focused PaRSEC runtime. High performance of the block-sparse tensor contraction on the Summit supercomputer is demonstrated for synthetic data as well as for real data involved in electronic structure simulations of unprecedented size.

March 11, 2021, 3pm.

Maxime Gonthier: Locality-Aware Scheduling of Independant Tasks for Runtime Systems.

A now-classical way of meeting the increasing demand for computing speed by HPC applications is the use of GPUs and/or other accelerators. Such accelerators have their own memory, which is usually quite limited, and are connected to the main memory through a bus with bounded bandwidth. Thus, a particular care should be devoted to data locality in order to avoid unnecessary data movements. Task-based runtime schedulers have emerged as a convenient and efficient way to use such heterogeneous platforms. When processing an application, the scheduler has the knowledge of all tasks available for processing on a GPU, as well as their input data dependencies. Hence, it is able to order tasks and prefetch their input data in the GPU memory (after possibly evicting some previously-loaded data), while aiming at minimizing data movements, so as to reduce the total processing time. In this paper, we focus on how to schedule tasks that share some of their input data (but are otherwise independent) on a GPU. We provide a formal model of the problem, exhibit an optimal eviction strategy, and show that ordering tasks to minimize data movement is NP-complete. We review and adapt existing ordering strategies to this problem, and propose a new one based on task aggregation. These strategies have been implemented in the StarPU runtime system, which allows to test them on a variety of linear algebra problems. Our experiments demonstrate that using our new strategy together with the optimal eviction policy reduces the amount of data movement as well as the total processing time.

March 4, 2021, 3pm.

Redouane Elghazi: Max-stretch minimization on an edge-cloud platform.

We consider the problem of scheduling independent jobs that are generated by processing units at the edge of the network. These jobs can either be executed locally, or sent to a centralized cloud platform that can execute them at greater speed. Such edge-generated jobs may come from various applications, such as e-health, disaster recovery, autonomous vehicles or flying drones. The problem is to decide where and when to schedule each job, with the objective to minimize the maximum stretch incurred by any job. The stretch of a job is the ratio of the time spent by that job in the system, divided by the minimum time it could have taken if the job was alone in the system. We formalize the problem and explain the differences with other models that can be found in the literature. We prove that minimizing the max-stretch is NP-complete, even in the simpler instance with no release dates (all jobs are known in advance). This result comes from the proof that minimizing the max-stretch with homogeneous processors and without release dates is NP-complete, a complexity problem that was left open before this work. We design several algorithms to propose efficient solutions to the general problem, and we conduct simulations based on real platform parameters to evaluate the performance of these algorithms.

February 4, 2021, 3pm.

Valentin Honoré: Scheduling stochastic jobs on reservation-based platforms.

With the expected convergence between HPC, BigData and AI, new applications with
different profiles are coming to HPC. Among them, stochastic jobs are jobs
whose execution time cannot be determined easily. They arise from the
heterogeneous, dynamic and data-intensive requirements of new emerging fields
such as neuroscience.

However, the uncertainty of their execution time remains a strong limitation
when using supercomputers. Indeed, the user needs to estimate how long his job
will have to be executed by the machine, and enters this estimation as his/her
first reservation value. But if the job does not complete successfully within
this first reservation, the user will have to resubmit the job, this time
requiring a longer reservation. In the end, the total cost for the user will be
the overall cost of all the reservations that were necessary to achieve the
successful completion of the job.

In this talk, I propose an overview of different contributions for scheduling
stochastic jobs on reservation-based platforms.

I will present scheduling contributions for stochastic jobs
under the form of reservation strategies. A reservation strategy determines a
sequence of increasing- length reservations, which are paid for until one of
them allows the job to successfully complete. The goal is to minimize the total
expected cost of the strategy. We derived strategies including checkpointing at
the end of some (well-chosen) reservations, to avoid wasting the benefits of
failed reservations.

I will then discuss about the applicability of the strategies
presented above. By performing an in-depth profiling of a representative
stochastic application, we will show the limitations of the previous approaches
and describe adapted strategies that better fit the properties of applications.

Joint work with Guillaume Pallez (Aupy), Brice Goglin, Yves Robert and Ana Gainaru.

January 28, 2021, 3pm.

Suraj Kumar: Parallel Tensor Train through Hierarchical Decomposition.

We consider the problem of developing parallel decomposition and approximation algorithms for high dimensional tensors. We focus on a tensor representation named Tensor Train (TT). It stores a d-dimensional tensor in O(ndr^2), much less than the O(n^d) entries in the original tensor, where 'r' is usually a very small number and depends on the application. Sequential algorithms to compute TT decomposition and TT approximation of a tensor have been proposed in the literature. We propose a parallel algorithm to compute TT decomposition of a tensor. We prove that the ranks of TT-representation produced by our algorithm are bounded by the ranks of unfolding matrices of the tensor. We also propose a parallel algorithm to compute approximation of a tensor in TT-representation. Our algorithm relies on a hierarchical partitioning of the dimensions of the tensor in a balanced binary tree shape and transmission of leading singular values of associated unfolding matrix from the parent to its children. We consider several approaches on the basis of how leading singular values are transmitted in the tree. We present an in-depth experimental analysis of our approaches for different low rank tensors. Our results show that the approach which transmits leading singular values to both of its children performs better in practice.

January 21, 2021, 3pm.

Frédéric Vivien: Resource-Constrained Scheduling of Stochastic Tasks With Unknown Probability Distribution.

This work introduces scheduling strategies to maximize the expected numberof independent tasks that can be executed on a cloud platform within a given budgetand under a deadline constraint. Task execution times are not known before execution;instead, the only information available to the scheduler is that they obey some (unknown) probability distribution. The scheduler needs to acquire some information before decidingfor a cutting threshold: instead of allowing all tasks to run until completion, one maywant to interrupt long-running tasks at some point. In addition, the cutting thresholdmay be reevaluated as new information is acquired when the execution progresses further.This works presents several strategies to determine a good cutting threshold, and to decidewhen to re-evaluate it. In particular, we use the Kaplan-Meier estimator to account fortasks that are still running when making a decision. The efficiency of our strategies isassessed through an extensive set of simulations with various budget and deadline values,and ranging over 14 probability distributions.

January 7, 2021, 3pm.

Lucas Pérotin: Resilient Scheduling of Moldable Jobs on Failure-Prone Platforms.

This paper focuses on the resilient scheduling of moldable parallel jobs on high-performance computing (HPC) platforms. Moldable jobs allow for choosing a processor allocation before execution, and their execution time obeys various speedup models. The objective is to minimize the overall completion time of the jobs, or makespan, assuming that jobs are subject to arbitrary failure scenarios, and hence need to be re-executed each time they fail until successful completion. This work generalizes the classical framework where jobs are known offline and do not fail. We introduce a list-based algorithm, and prove new approximation ratios for three prominent speedup models (roofline, communication, Amdahl). We also introduce a batch-based algorithm, where each job is allowed a restricted number of failures per batch, and prove a new approximation ratio for the arbitrary speedup model. We conduct an extensive set of simulations to evaluate and compare different variants of the two algorithms. The results show that they consistently outperform some baseline heuristics. In particular, the list algorithm performs better for the roofline and communication models, while the batch algorithm has better performance for the Amdahl's model. Overall, our best algorithm is within a factor of 1.47 of a lower bound on average over the whole set of experiments, and within a factor of 1.8 in the worst case.

December 17, 2020, 3pm.

Yishu Du: Robustness of the Young/Daly formula for stochastic iterative applications

The Young/Daly formula for periodic checkpointing is known to hold fora divisible load application where one can checkpoint at any time-step. In an nutshell, the optimal period isPYD=√2μfC where μf is the Mean Time Between Failures (MTBF) and C is the checkpoint time. This paper assesses the accuracy of the formulafor applications decomposed into computational iterations where: (i) the duration of aniteration is stochastic, i.e., obeys a probability distribution law D of mean μD ; and (ii) onecan checkpoint only at the end of an iteration. We first consider static strategies where checkpoints are taken after a given number of iterationskand provide a closed-form,asymptotically optimal, formula fork, valid for any distribution D. We then show that using the Young/Daly formula to compute k (ask·μD=PYD) is a first order approximationof this formula. We also consider dynamic strategies where one decides to checkpoint at the end of an iteration only if the total amount of work since the last checkpoint exceeds a threshold Wth, and otherwise proceed to the next iteration. Similarly, we provide aclosed-form formula for this threshold and show that PYD is a first-order approximation of Wth. Finally, we provide an extensive set of simulations where D is either Uniform,Gamma or truncated Normal, which shows the global accuracy of the Young/Daly formula,even when the distribution D had a large standard deviation (and when one cannot use afirst-order approximation). Hence we establish that the relevance of the formula goes well beyond its original framework

December 10, 2020, 3pm.

Yikin Gao: Energy-aware strategies for reliability-oriented real-time task allocation on heterogeneous platforms

Low energy consumption and high reliability are widely identified as increasingly relevant issues in real-time systems on heterogeneous platforms. In this paper, we propose a multi-criteria optimization strategy to minimize the expected energy consumption while enforcing the reliability threshold and meeting all task deadlines. The tasks are replicated to ensure a prescribed reliability threshold. The platforms are composed of processors with different (and possibly unrelated) characteristics, including speed profile, energy cost and failure rate. We provide several mapping and scheduling heuristics towards this challenging optimization problem. Specifically, a novel approach is designed to control (i) how many replicas to use for each task, (ii) on which processor to map each replica and (iii) when to schedule each replica on its assigned processor. Different mappings achieve different levels of reliability and consume different amounts of energy. Scheduling matters because once a task replica is successful, the other replicas of that task are cancelled, which calls for minimizing the amount of temporal overlap between any replica pair. The experiments are conducted for a comprehensive set of execution scenarios, with a wide range of processor speed profiles and failure rates. The comparison results reveal that our strategies perform better than the random baseline, with a gain of 40% in energy consumption, for nearly all cases. The absolute performance of the heuristics is assessed by a comparison with a lower bound; the best heuristics achieve an excellent performance, with an average value only 4% higher than the lower bound.

November 19, 2020, 3pm.

Grégoire Pichon: Trading Performance for Memory in Sparse Direct Solvers using Low-rank Compression

Sparse direct solvers using Block Low-Rank compression have been proven efficient to solve problems arising in many real-life applications. Improving those solvers is crucial for being able to 1) solve larger problems and 2) speed up computations. A main characteristic of a sparse direct solver using low-rank compression is when compression is performed. There are two distinct approaches: (1) all blocks are compressed before starting the factorization, which reduces the memory as much as possible, or (2) each block is compressed as late as possible, which usually leads to better speedup. The objective of this paper is to design a composite approach, to speedup computations while staying under a given memory limit. This should allow to solve large problems that cannot be solved with Approach 2 while reducing the execution time compared to Approach 1. We propose a memory-aware strategy where each block can be compressed either at the beginning or as late as possible. We first consider the problem of choosing when to compress each block, under the assumption that all information on blocks is perfectly known, i.e., memory requirement and execution time of a block when compressed or not. We show that this problem is a variant of the NP-complete Knapsack problem, and adapt an existing 2-approximation algorithm for our problem. Unfortunately, the required information on blocks depends on numerical properties and in practice cannot be known in advance. We thus introduce models to estimate those values. Experiments on the PaStiX solver demonstrate that our new approach can achieve an excellent trade-off between memory consumption and computational cost. For instance on matrix Geo1438, Approach 2 uses three times as much memory as Approach 1 while being three times faster. Our new approach leads to an execution time only 30% larger than Approach 2 when given a memory 30% larger than the one needed by Approach 1.