Mapping applications onto parallel platforms is a challenging problem, that becomes even more difficult when platforms are heterogeneous --nowadays a standard assumption. A high-level approach to parallel programming not only eases the application developer's task, but it also provides additional information which can help realize an efficient mapping of the application. In this paper, we discuss the mapping of pipeline skeletons onto different types of platforms: fully homogeneous platforms with identical processors and interconnection links; communication homogeneous platforms, with identical links but different speed processors; and finally, heterogeneous platforms. We assume that a pipeline stage must be mapped on a single processor, and we establish new theoretical complexity results for two different mapping policies: a mapping can be either one-to-one (a processor is assigned at most one stage), or interval-based (a processor is assigned an interval of consecutive stages). We provide several efficient polynomial heuristics for the most important policy/platform combination, namely interval-based mappings on communication homogeneous platforms.
The memory usage of sparse direct solvers can be the bottleneck to solve large-scale problems involving sparse systems of linear equations of the form A x = b. This report describes a prototype implementation of an out-of-core extension to a parallel multifrontal solver (MUMPS), where disk is used to store data that cannot fit in memory. We show that, by storing the factors to disk, larger problems can be solved on limited-memory machines with reasonable performance. We illustrate the impact of low-level IO mechanisms on the behaviour of our parallel out-of-core factorization. Then we use simulations to analyze the gains that can be expected when also storing the so called active memory on disk. We discuss both the minimum memory requirements and the minimum volume of I/O in a limited memory environment. Finally we summarize the main points that we identified to be critical when designing parallel sparse direct solvers in an out-of-core environment.
Keywords:
Sparse direct solver; large matrices; multifrontal method; out-of-core; parallel factorization; IO volume; direct IO; performance study.
Comments on ``Design and performance evaluation of load distribution strategies for multiple loads on heterogeneous linear daisy chain networks''.
By: Matthieu Gallet, Yves Robert, Frédéric Vivien
Number: RR2007-07
Date: February 2007
Abstract:
Min, Veeravalli, and Barlas proposed strategies to minimize the overall execution time of one or several divisible loads on a heterogeneous linear network, using one or more installments. We show on a very simple example that the proposed approach does not always produce a solution and that, when it does, the solution is often suboptimal. We also show how to find an optimal scheduling for any instance, once the number of installments per load is given. Finally, we formally prove that under a linear cost model, as in the original paper, an optimal schedule has an infinite number of installments. Such a cost model can therefore not be used to design practical multi-installment strategies.
A First Step Towards Automatically Building Network Representations.
By: Lionel Eyraud-Dubois, Arnaud Legrand, Martin Quinson, and Frédéric Vivien
Number: RR2007-08
Date: February 2007
Abstract:
To fully harness Grids, users or middlewares must have some knowledge on the topology of the platform interconnection network. As such knowledge is usually not available, one must uses tools which automatically build a topological network model through some measurements. In this article, we define a methodology to assess the quality of these network model building tools, and we apply this methodology to representatives of the main classes of model builders and to two new algorithms. We show that none of the main existing techniques build models that enable to accurately predict the running time of simple application kernels for actual platforms. However some of the new algorithms we propose give excellent results in a wide range of situations.
Optimal Closest Policy with QoS and Bandwidth Constraints for Placing Replicas in Tree Networks.
By: Veronika Rehn
Number: RR2007-10
Date: March 2007
Abstract:
This paper deals with the replica placement problem on fully homogeneous tree networks known as the Replica Placement optimization problem. The client requests are known beforehand, while the number and location of the servers are to be determined. We investigate the latter problem using the Closest access policy when adding QoS and bandwidth constraints. We propose an optimal algorithm in two passes using dynamic programming.
Keywords:
Replica placement, tree networks, Closest policy, quality of service, bandwidth constraints.
Large problems ranging from numerical simulation can now be solved through the Internet using grid middleware. This report describes the different steps involved to make available a service in the \diet grid middleware. The cosmological \ramses application is taken as an example to detail the implementation. Furthermore, several results are given in order to show the benefits of using \diet, among which the transparent usage of numerous clusters and a low overhead (finding the right resource and submitting the computing task).
Complexity results for throughput and latency optimization of replicated and data-parallel workflows.
By: Anne Benoit, Yves Robert
Number: RR2007-12
Date: March 2007
Abstract:
Mapping applications onto parallel platforms is a challenging problem, even for simple application patterns such as pipeline or fork graphs. Several antagonist criteria should be optimized for workflow applications, such as throughput and latency (or a combination). In this paper, we consider a simplified model with no communication cost, and we provide an exhaustive list of complexity results for different problem instances. Pipeline or fork stages can be replicated in order to increase the throughput of the workflow, by sending consecutive data sets onto different processors. In some cases, stages can also be data-parallelized, i.e. the computation of one single data set is shared between several processors. This leads to a decrease of the latency and an increase of the throughput. Some instances of this simple model are shown to be NP-hard, thereby exposing the inherent complexity of the mapping problem. We provide polynomial algorithms for other problem instances. Altogether, we provide solid theoretical foundations for the study of mono-criterion or bi-criteria mapping optimization problems.
Arithmetic Operators for Pairing-Based Cryptography.
By: Jean-Luc Beuchat, Nicolas Brisebarre, Jérémie Detrey, and
Eiji Okamoto
Number: RR2007-13
Date: March 2007
Abstract:
Since their introduction in constructive
cryptographic applications, pairings over (hyper)elliptic curves are at
the heart of an ever increasing number of protocols. Software implementations
being rather slow, the study of hardware architectures became an active
research area. In this paper, we first study an accelerator for the
ηT pairing over
F3[x]/(x97 + x12 + 2).
Our architecture is based on a unified
arithmetic operator which performs addition, multiplication, and cubing over
F397. This design methodology allows us to
design a compact coprocessor (1888 slices on a Virtex-II Pro 4 FPGA) which
compares favorably with other solutions described in the open literature. We
then describe ways to extend our approach to any characteristic and any
extension field.
Keywords:
pairing, finite field
arithmetic, elliptic curve, hardware accelerator, FPGA.
Acyclicity and Finite Linear Extendability: a Formal and Constructive Equivalence.
By: Stéphane Le Roux
Number: RR2007-14
Date: March 2007
Abstract:
Linear extension of partial orders emerged in the late 1920's. Its computer-oriented version, \emph{i.e.}, topological sorting of finite partial orders, arose in the late 1950's. However, those issues have not yet been considered from a viewpoint that is both formal and constructive; this paper discusses a few related claims formally proved with the constructive proof assistant Coq. For instance, it states that a given decidable binary relation is acyclic and equality is decidable on its domain \emph{iff} an irreflexive linear extension can be computed uniformly for any of its finite restriction. A detailed introduction and proofs written in plain English shall help readers who are not familiar with constructive issues or Coq formalism.
Comparison and tuning of MPI implementations in a grid context.
By: Ludovic Hablot
Number: RR2007-21
Date: May 2007
Abstract:
Today, clusters are often interconnected by long distance networks
within grids to offer a huge number of available ressources to a range of users.
MPI, the standard communication library used to write parallel applications,
has been implemented for clusters. Two main features of grids: long distance
networks and technological heterogeneity, raise the question of MPI efficiency
in grids.
This report presents an evaluation of four recent MPI
implementations (MPICH2, MPICH-Madeleine, OpenMPI and GridMPI) in the french
research grid: Grid'5000. The comparison is based on the execution of pingpong,
NAS Parallel Benchmarks and a real application in geophysics. We show that this
implementations present performance differences. Executing MPI applications on
the grid can be beneficial if the parameters are well tuned. The paper details
the tuning required on each implementation to get the best performances.
High performance sparse direct solvers are often a method of choice in various simulation problems. However, they require a large amount of memory compared to iterative methods. In this context, out-of-core solvers must be employed, where disks are used when the storage requirements are too large with respect to the physical memory available. In this paper, we study how to minimize the I/O requirements in the multifrontal method, a particular direct method to solve large-scale problems efficiently. From a theoretical point of view, we show that minimizing the storage requirement can lead to a huge volume of I/O compared to directly minimizing the I/O volume. Then experiments on large real-life problems also show that the volume of I/O obtained when minimizing the storage requirement can be significantly reduced by applying algorithms designed to reduce the I/O volume. We finally propose efficient memory management algorithms that can be applied to all the variants proposed.
Keywords:
Sparse direct solver; out-of-core; large matrices; multifrontal method; IO volume minimization; memory management schemes.
In Grids, transfer jobs and network resources need to be managed in amore deterministic way than in the Internet. New approaches like flow scheduling are proposed and studied as alternatives to traditional QoS and reservation proposals. To enable such flow scheduling approaches, run-time mechanisms controlling flow sending time and rate have to be implemented in the data plane. This paper quantify and compares in a range of latency conditions, such end-host based mechanisms combined with transport protocols to instantiate different scheduling strategies. We show that, in high speed network, a single-rate scheduling strategy implemented by an AIMD-based protocol with packet pacing mechanism offers predictable performanceand is insensitive to latency. This paper also highlights the limits of other strategies and rate limitation mechanisms like token bucket which may present unpredictability and other drawbacks.
Keywords:
Grid networks, flow scheduling, bulk data transfers, rate limitation, pacing.
A certified infinite norm for the implementation of elementary functions.
By:Sylvain Chevillard, Christoph Quirin Lauter
Number: RR2007-26
Date: May 2007
Abstract:
The high-quality floating-point implementation of useful functions f : R -> R, such as exp, sin, erf requires bounding the error eps = (p-f)/f of an approximation p with regard to the function f. This involves bounding the infinite norm ||eps|| of the error function. Its value must not be underestimated when implementations must be safe.
Previous approaches for computing infinite norm are shown to be either unsafe, not sufficiently tight or too tedious in manual work.
We present a safe and self-validating algorithm for automatically upper- and lower-bounding infinite norms of error functions. The algorithm is based on enhanced interval arithmetic. It can overcome high cancellation and high condition number around points where the error function is defined only by continuous extension.
The given algorithm is implemented in a software tool. It can generate a proof of correctness for each instance on which it is run.
Array-OL is a high-level specification language dedicated to
the definition of intensive signal processing applications. Several tools
exist for implementing an Array-OL specification as a data parallel program.
While Array-OL can be used directly, it is often convenient to be able
to deduce part of the specification from a sequential version of the
application. This paper proposes such an analysis and examines its
feasibility and its limits.
Keywords:
Array-OL, multidimensional signal processing, program analysis.
Scheduling multiple divisible loads on a linear processor network.
By:Matthieu Gallet, Yves Robert, Frédéric Vivien
Number: RR2007-29
Date: June 2007
Abstract:
Min, Veeravalli, and Barlas have
recently proposed strategies to minimize the overall execution time of
one or several divisible loads on a heterogeneous linear network,
using one or more installments. We show on a very simple example that
their approach does not always produce a solution and that, when it
does, the solution is often suboptimal. We also show how to find an
optimal schedule for any instance, once the number of installments per
load is given. Then, we formally state that any optimal schedule has
an infinite number of installments under a linear cost model as the
one assumed in the original papers. Therefore, such a cost model
cannot be used to design practical multi-installment
strategies. Finally, through extensive simulations we confirmed that
the best solution is always produced by the linear programming
approach, while solutions of the original papers can be far away from
the optimal..
Mapping workflow applications onto parallel platforms is a challenging problem, even for simple application patterns such as pipeline graphs. Several antagonist criteria should be optimized, such as throughput and latency (or a combination). In this paper, we study the complexity of the bi-criteria mapping problem for pipeline graphs on communication homogeneous platforms. In particular, we assess the complexity of the well-known chains-to-chains problem for different-speed processors, which turns out to be NP-hard. We provide several efficient polynomial bi-criteria heuristics, and their relative performance is evaluated through extensive simulations.
Standard TCP faces some performance limitations in very high speed wide area networks, mainly due to a long end-to-end feedback loop and a conservative behaviour with respect to congestion. Many TCP variants have been proposed to overcome these limitations. However, TCP is a complex protocol with many user-configurable parameters and a range of different implementations. It is then important to define measurement methods so that the transport services and protocols can evolve guided by scientific principles and compared quantitatively. The goal of this report is to present some steps towards a user-oriented benchmark, called ITB, for high speed transport protocols comparison. We first present and analyse some results reported in the literature. From this study we identify classes of representative applications and useful metrics. We then isolate infrastructure parameters and traffic factors which influence the protocol behaviour. This enable us to define scenario capturing and synthesising comprehensive and useful properties. We finally illustrate this proposal by preliminary results obtained on our experimental environment, Grid'5000, we have built and are using for contributing in this benchmark design.
Keywords:
Protocol Benchmark, TCP, Performance evaluation, High Speed transport, High Speed networks.
In high performance distributed computing applications, data movements have demanding performance requirements such as reliable and predictable delivery. Predicting the throughput of large transfers is very difficult in paths that are heavily loaded with just a few big flows. This report explores how current high speed transport protocols behave and may improve transfer time predictability of gigabits of data among endpoints in a range of conditions. In a fully controlled long distance 10~Gbps network testbed, we compare several TCP variants behaviour in presence of diverse congestion level and reverse traffic situations. We show that these factors have a very strong impact on transfer time predictability of several transport protocols.
Keywords:
Bulk Data Transfers, Bandwidth Sharing, Transfer Delay Predictability, Transport Protocol Experimentation.