A simple test qualifying the accuracy of Horner's rule for polynomials.
By: Marc Daumas, Sylvie Boldo
Number: RR2003-01
Date: January 2003
Abstract:
Polynomials are used in many applications and hidden
in libraries such as libm. Whereas the accuracy of
the functions used by linear algebra have long been
studied, little is available to decide on one scheme
to evaluate a polynomial. Common knowledge solely
emphasizes that Horner's rule is a good scheme unless
the indeterminate is close to one of the polynomial's
roots. We propose here a criterion for one step of
Horner's scheme to be faithful. A result is defined
to be faithful when it was correctly rounded whereas
the rounding mode (up, down or to the nearest) cannot
be known by the user. Our criterion is checked
against the IEEE standard for floating point
arithmetic using the Coq automatic proof checker. We
then present three programs in Maple, Java and C that
check the criterion for a whole polynomial associated
with a domain for the indeterminate and a possible
truncation error. An example of use is given with
the approximation of elementary functions.
Keywords:
Floating Point, IEEE 754 Standard, Horner's Rule, Formal Proof, Coq.
We study the link between the complexity of polynomial matrix multiplication
and the complexity of solving other basic linear algebra problems on polynomial matrices.
By polynomial matrices we mean n x n matrices of degree d over K[x]
where K is a commutative field. Under the straight-line program model
we show that multiplication is reducible to the problem of computing
the coefficient of degree d of the determinant.
Conversely, we propose algorithms
for minimal approximant computation and
column reduction that are based on polynomial matrix multiplication;
for the determinant, the straight-line program we give also relies on matrix product over K[x]
and provides an alternative to Storjohann's determinant algorithm.
We further show that all these problems can be solved in particular in
O~(n^w d) operations in K.
Here the "soft Oh'' notation O~ indicates some missing
log(nd) factors and w is the exponent of matrix multiplication overK.
A Hierarchical Resource Reservation Algorithm for Network Enabled Servers.
By: Eddy Caron, Frédéric Desprez, Franck Petit and Vincent Villain
Number: RR2003-03
Date: January 2003
Abstract:
This paper presents the application of the PIF algorithm to a
Network Enabled Server environment. Hierarchical scheduling is
applied to improve the scalability of the overall architecture and
fault tolerance problems are addressed using timers. The
simulation shows that gains can be obtained using such a platform
over single scheduler approaches.
A lot of progress has been made in tiling
theory in the last ten years after
Thurston (\cite{Thu90}), building on previous work by Conway and Lagarias
(\cite{CL90}), introduced height functions as a tool to encode and study
tilings.
This allowed the authors of this paper, in previous work (\cite{Rem99},
\cite{Des01}), to prove that the set of lozenge (or domino) tilings of
a hole-free, general-shape domain in the plane can be endowed with a
distributive lattice structure.
In this paper, we see that this structure allows us in turn to construct an
algorithm that is optimal with respect to both space and execution time to
generate all the tilings of a domain~$D$. We first recall some results about
tilings and then we describe the algorithm.
Group communications (multicast) are foreseen to be one of
the most critical yet challenging technologies to meet the
exponentially growing demands for data distribution in a large variety
of applications of the Internet (grid computing, web applications,
distributed simulations\ldots). When reliability is required, there is
no straightforward solutions and meeting the objectives of reliable
multicast is not an easy task. Active networks open a new perspective
in providing more efficient solutions for the problem of
reliability. In this context, routers are able to perform customized
computations on the packets flowing through them. In this report, we
propose a receiver-based (replier) local recovery multicast protocol
with dynamic repliers elected on a per-packet basis. Designed to
provide an efficient reliable multicast service without any cache
facilities inside the network, our approach, uses low-overhead active
services in routers. The current report addresses the design,
evaluation and the implementation of an efficient and scalable
reliable multicast protocol noted DyRAM standin for Dynamic Replier
Active reliable Multicast.
Keywords:
Active networks, Reliable multicast, Congestion avoidance, Evaluation, Analysis, Simulation, Implementation.
An Analysis of a Router-based Loss Detection Service for Active Reliable Multicast Protocols.
By: Moufida Maimour, Cong-Duc Pham
Number: RR2003-06
Date: January 2003
Abstract:
Group communications (multicast) are foreseen to be one of the most
critical yet challenging technologies to meet the exponentially
growing demands for data distribution in a large variety of
applications of the Internet (grid computing, video-conferencing, web
applications, distributed simulations\ldots). When reliability is
required, there is no straightforward solutions and meeting the
objectives of reliable multicast is not an easy task. Active networks
open a new perspective in providing more efficient solutions for the
problem of reliability. In this context, routers are able to perform
customized computations (services) on the packets flowing through
them. In this paper, we propose a new service consisting in an early
loss detection service to be deployed into routers. We also show how
the loss detection service can improve the performances (in term of
the recovery delays) of an active reliable multicast protocol such as
DyRAM making it more suitable for applications requiring low
latencies.
Keywords:
Active networks, Reliable multicast, Analysis, Simulation.
AMCA: an Active-based Multicast Congestion Avoidance
Algorithm.
By: Moufida Maimour, Cong-Duc Pham
Number: RR2003-07
Date: January 2003
Abstract:
Many works have recently addressed the issue of congestion
control for multicast communications and the problem is known to be
highly complex. Scalability, responsiveness, stability and fairness
with TCP are some of the required properties. In this report, we
present a congestion avoidance scheme for bulk data distribution
called AMCA (Active-based Multicast Congestion Avoidance algorithm)
that tries to meet these properties. We use the active networking
technology to perform on a per-section dialogue to probe for available
bandwidth along a multicast tree. The solution uses the RTT variations
experienced by every branch to estimate the congestion situation in
the multicast tree. The physical multicast tree is also used to
appropriately aggregate the RTT variations at intermediate nodes
before they reach the source. Simulations show that AMCA converges,
makes use of the available bandwidth and reacts rapidly to dynamic
changes while being TCP-fair. The results also show that losses are
very few demonstrating the conservative property of AMCA which reacts
to congestion before a loss occurs.
Keywords:
Active Networks, Reliable Multicast, Congestion Avoidance.
Analysis and Improvments of the Memory Usage of a Multifrontal Solver.
By: Abdou Guermouche, Jean-Yves L'Excellent, and Gil Utard
Number: RR2003-08
Date: February 2003
Abstract:
We are concerned with the memory usage of sparse direct solvers.
We particularly focus on the influence of state-of-the-art sparse
matrix reordering techniques on the dynamic memory usage of a
multifrontal solver, MUMPS, and present algorithms to modify the
assembly tree traversal that aim at reducing the memory usage. A
large number of experiments show the interest of the approach for
sequential executions.
By: Arnaud Legrand, Frédéric Mazoit, Martin Quinson
Number: RR2003-09
Date: February 2003
Abstract:
This paper presents a tool to automatically discover the network topology. The goal is to evaluate the performance of concurrent transfers (for example to improve collective
communications) and not to discover the physical machines interconnection scheme (for administration purposes). The problems encountered, preliminary algorithms to solve them, as
well as theoretical proof of their validity (under some conditions) are presented.
Keywords:
Grid Computing, Simulation, Network Topology, Communication Performance Prediction.
We adapt the radix-$r$ digit-recurrence division algorithm to complex division. By prescaling the operands, we make the selection of quotient digits simple. This leads to a simple hardware implementation, and allows correct rounding of complex quotient. To reduce large prescaling tables required for radices greater than $4$, we adapt the bipartite-table method to multiple-operand functions.
Taylor models and floating-point arithmetic: proof that arithmetic operations are validated in COSY.
By: Nathalie Revol, Kyoko Makino, Martin Berz
Number: RR2003-11
Date: February 2003
Abstract:
The goal of this paper is to prove that the implementation of Taylor
models in COSY, based on floating-point arithmetic, computes results
satisfying the ``containment property'', i.e. guaranteed results.
First, Taylor models are defined and their implementation in the COSY
software by Makino and Berz is detailed.
Afterwards IEEE-754 floating-point arithmetic is introduced. Then the core of this
paper is given: the algorithms implemented in COSY for multiplying
a Taylor model by a scalar, for adding or multiplying two
Taylor models are given and are proven to return Taylor models
satisfying the containment property.
Static load-balancing techniques for iterative computations on heterogeneous
clusters.
By: Hélène Renard, Yves Robert, Frédéric Vivien
Number: RR2003-12
Date: February 2003
Abstract:
This paper is devoted to static load balancing techniques for
mapping iterative algorithms onto heterogeneous clusters. The
application data is partitioned over the processors. At each
iteration, independent calculations are carried out in parallel,
and some communications take place. The question is to determine
how to slice the application data into chunks, and to assign these
chunks to the processors, so that the total execution time is
minimized. We establish a complexity result that assesses the
difficulty of this problem, and we design practical heuristics
that provide efficient distribution schemes.
This document collects some
important results about the theory of Milner's pi-calculus and
related formalisms. We present the syntax and semantics of a monadic
calculus, and discuss type systems and the most commonly used
notions of behavioural equivalences. Pi-calculus dialects are also
briefly introduced, as well as several encodings of the
lambda-calculus.
Nota: These notes have been used for a course in the winter
school on theoretical computer science organised by the Franco-Italian
University in Aussois between the 29th of january and the 1st of
february, 2003.
Keywords:
Process Algebra, Mobility, Name
Passing, Pi-calculus, Lambda-calculus, Type System, Behavioural
Equivalence.
Distributed computing power : from local function to global
computing.
By: Laurent Bienvenu, Christophe Papazian
Number: RR2003-15
Date: March 2003
Abstract:
We show here a natural extension of finite graph
automata, by allowing each node of a network to store in its memory
some pieces of information that are only bounded by the size of the
underlying network (like a unique address). Depending of the power of
the new local transition function, we show results about the power of
the global function computed by the graph automata. The main result
is that the global power is always less than the local computing
power and even with very powerful local function (non recursive) we
can not compute all global functions (even some primitive recursive
ones) : Distributed computing is limited by its own structure.
A RTT-based Partitioning Algorithm for a Multi-rate Reliable
Multicast Protocol.
By: Moufida Maimour, Cong-Duc Pham
Number: RR2003-16
Date: March 2003
Abstract:
Various Internet applications involve multiple parties
and usually adopt a one-to-many communication paradigm
(multicast). The presence of multiple receivers in a multicast session
rises the problem of {\it inter-receiver fairness}. Transmitting with
a rate which matches the slowest receiver will limit the throughput of
other receivers and thus their {\it satisfaction}. A multi-rate
mechanism where the receivers are distributed into subgroups with
similar capacities, can improve the inter-receiver fairness for
multicast sessions. In this report, we deal with the problem of
receivers partitioning and propose a simple algorithm based on the
receivers RTT variations where an explicit estimation of the receivers
capacities is avoided. Our partitioning algorithm, although simple,
performs an on-the-fly partitioning depending on the receivers'
feedback. We show that our partitioning algorithm approximates and in
many cases, achieves the optimal solution with a minimum computation
effort.
Load-Balancing Scatter Operations for Grid Computing.
By: Stéphane Genaud, Arnaud Giersch, Frédéric Vivien
Number: RR2003-17
Date: March 2003
Abstract:
We present solutions to statically load-balance scatter operations in
parallel codes run on grids. Our load-balancing strategy is based on
the modification of the data distributions used in scatter
operations. We study the replacement of scatter operations with
parameterized scatters, allowing custom distributions of data. The
paper presents: 1) a general algorithm which finds an optimal
distribution of data across processors; 2) a quicker guaranteed
heuristic relying on hypotheses on communications and computations; 3)
a policy on the ordering of the processors. Experimental results with
an MPI scientific code illustrate the benefits obtained from our
load-balancing.
Performance et dynamicité dans les réseaux : l'approche Tamanoir.
By: Jean-Patrick Gelas, Laurent Lefèvre
Number: RR2003-18
Date: March 2003
Abstract:
Les réseaux actifs constituent un domaine en pleine
expansion. Bien que différentes approches aient déjà été explorées,
privilégiant les aspects sécurité et portabilité, rares sont les
projets s'attaquant aux problèmes des hautes performances dans les
environnements actifs. En combinant la haute performance et le
déploiement de services portables, nous explorons de nouvelles
approches capables d'ouvrir des pistes pour le développement des
réseaux actifs. Nous proposons ainsi une nouvelle modélisation des
réseaux actifs haute performance et sa mise en \oe uvre sous la forme
de l'environnement \emph{Tamanoir}. Cet environnement fournit aux
utilisateurs la possibilité de déployer et de maintenir dynamiquement
des routeurs actifs distribués sur un réseau à grande échelle (WAN) à
l'aide d'outils spécifiques. Les premières expérimentations démontrent
l'efficacité et la pertinence de notre approche.
Towards an IPv6-based Security Framework for Distributed Storage Resources.
By: Alessandro Bassi, Julien Laganier
Number: RR2003-19
Date: March 2003
Abstract:
Some security problems can be often solved through
authorization rather than authentication. Furthermore,
authorization approach avoids usual drawbacks of
centralized systems such as bottlenecks or single point
of failure. In this paper, we propose a solution that
could bring an appropriate security architecture to the
Internet Backplane Protocol (IBP), a distributed shared
storage protocol. The three basic building blocks are
IPsec, Simple Public Key Infrastructure (SPKI)
certificates and Crypto-Based Identifiers (CBID).
CBID allows entities to prove ownership of their
identifiers, SPKI allows entities to prove that they have
been authorized to performs specific actions while IPsec
provides data origin authentication and confidentiality.
We propose to use them to bring some level of
'opportunistic' security in the absence of any trusted
central authority. This is particularly tailored to
ad-hoc environments where collaborations might be very
short-termed.
Dealing with Heterogeneity in an Active-based Multicast Congestion Avoidance Protocol.
By: Moufida Maimour, Cong-Duc Pham
Number: RR2003-20
Date: March 2003
Abstract:
Many of the proposed multicast congestion avoidance algorithms are
single-rate where heterogeneity is accommodated by adjusting the
transmission rate as a response to the worst receiver in the
group. Due to the Internet heterogeneity, a single-rate congestion
control affects the overall satisfaction of a multicast session
receivers. In this report, we propose a multi-rate replicated scheme
where some receivers (instead of the source) are designated to perform
data replication for other receivers with lower capacity. To be more
scalable and to minimize the bandwidth consumption due to data
replication, the partitioning algorithm is performed on-the-fly by the
routers depending on the feedbacks they receive. Neither a prior
estimation of the receivers capacity is necessary nor a complex
computation is required to execute our partitioning algorithm.
Keywords:
Reliable Multicast, Congestion Control, Heterogeneity, Active Networks.
Finding the "truncated" polynomial that is closest to a function.
By: Nicolas Brisebarre, Jean-Michel Muller
Number: RR2003-21
Date: April 2003
Abstract:
When implementing regular enough functions (e.g., elementary or
special functions) on a computing system, we frequently use
polynomial approximations. In most cases, the polynomial that best
approximates (for a given distance and in a given interval) a
function has coefficients that are not exactly representable with
a finite number of bits. And yet, the polynomial approximations
that are actually implemented do have coefficients that are
represented with a finite - and sometimes small - number of bits:
this is due to the finiteness of the floating-point
representations (for software implementations), and to the need to
have small, hence fast and/or inexpensive, multipliers (for
hardware implementations). We then have to consider polynomial
approximations for which the degree $i$ coefficient has at most
$m_i$ fractional bits (in other words, it is a rational number
with denominator $2^{m_i}$).We provide a method for finding the
best polynomial approximation under this constraint.
Performance Analysis of Remote File System Access over High Bandwidth Local Network.
By: Brice Goglin, Loïc Prylli
Number: RR2003-22
Date: April 2003
Abstract:
We study the performance of file servers, comparing NFS
implementation in Linux to our experimental lightweight system
called ORFA. The aim is to find out NFS bottlenecks in the case of
high bandwidth local network.
Using a simple protocol without cache allow us to get best
performance from the underlying communication subsystem.
Our user-level implementation avoids several kernel-level
constraints and thus provides better performance than NFS.
Moreover we explore several optimization techniques to reduce the
server overhead which usually is the bottleneck.
Keywords:
Distributed File System, Remote Access,
High Bandwidth Local Network, Cluster, Linux, NFS.
Load-balancing iterative computations in heterogeneous clusters with shared
communication links.
By: Arnaud Legrand, Hélène Renard, Yves Robert, Frédéric Vivien
Number: RR2003-23
Date: April 2003
Abstract:
This paper is devoted to mapping iterative algorithms onto
heterogeneous clusters. The application data is partitioned over
the processors, which are arranged along a virtual ring. At each
iteration, independent calculations are carried out in parallel,
and some communications take place between consecutive processors
in the ring. The question is to determine how to slice the
application data into chunks, and to assign these chunks to the
processors, so that the total execution time is minimized. One
major difficulty is to embed a processor ring into a network that
typically is not fully connected, so that some communication links
have to be shared by several processor pairs. We establish a
complexity result that assesses the difficulty of this problem,
and we design a practical heuristic that provides efficient
mapping, routing, and data distribution schemes.
Keywords:
Heterogeneous Clusters, Load-balancing, Shared Communication
Links, Complexity.
Decidable and undecidable problems about quantum automata.
By: Vincent D. Blondel, Emmanuel Jeandel, Pascal Koiran, Natacha Portier
Number: RR2003-24
Date: April 2003
Abstract:
We study the following decision problem: is the language recognized by
a quantum finite automaton empty or non-empty?
We prove that this problem is decidable or undecidable depending
on whether recognition is
defined by strict or non-strict thresholds. This result is in
contrast with the corresponding situation for probabilistic finite
automata for which it is known that strict and non-strict
thresholds both lead to undecidable problems.
By: Nathalie Caspard, Michel Morvan, Eric Remila, Eric Thierry
Number: RR2003-25
Date: April 2003
Abstract:
Many tiling spaces such as domino tilings of fixed figures have
an underlying lattice structure. This lattice structure corresponds to the
dynamics induced by flips. In this paper, we further investigate the properties
of these lattices of tilings. In particular, we point out a stability property:
the set of all the shortest sequences of flips joining to fixed tilings
also have a lattice structure close to the lattice of all tilings. We also
show that some of these properties also apply to other discrete dynamical
systems and more generally may be satisfied by some partially ordered sets.
It gives a new perspective on the lattice structure of tiling spaces and
enables to deduce some of their properties only by means of partial order
theoretical tools.
Keywords:
Tilings, Lattices, Stability of the Tilting Operation, Partial
Order Theory.
Dynamics of the Picking transformation on integer partitions.
By: Duong Phan Ti, Eric Thierry
Number: RR2003-26
Date: April 2003
Abstract:
This paper studies a conservative transformation defined on families
of finite sets. It consists in removing one element from each set and adding
a new set composed of the removed elements. This transformation is conservative
in the sense that the union of all sets of the family always remains the
same. We study the dynamical process obtained when iterating this transformation
on a family of sets and we focus on the evolution of the cardinalities of
the sets of the family. This point of view allows to consider the transformation
as an application defined on the set of all partitions of a fixed integer
(which is the total number of elements in the sets). We show that iterating
this particular transformation always leads to a heterogeneous distribution
of the cardinalities, where almost all integers within an interval are represented.
We also tackle some issues concerning the structure of the transition graph
which sums up the whole dynamics of this process for all partitions of a
fixed integer.
By: Michel Habib, David Kelly, Emmanuelle Lebhar, Christophe Paul
Number: RR2003-27
Date: April 2003
Abstract:
A graph $G_s=(V,E_s)$ is a sandwich
for a pair of
graph $G_t=(V,E_t)$ and $G=(V,E)$ if $E_t\subseteq E_s\subseteq
E$. Any poset, or partially ordered set, admits a unique graph
representation which is directed and transitive. In this paper we
introduce the notion of sandwich poset problems inspired by former
sandwich problems on comparability graphs. In particular, we are
interested in series-parallel and interval posets which are
subclasses of 2-dimensional posets, we describe polynomial
algorithms for these two classes of poset sandwich problems and
then prove that the problem of deciding the existence of a
2-dimensional sandwich poset is NP-complete.
Keywords:
Analysis of Algorithms,
Partially Ordered Sets, Graph Sandwich Problems.
Scheduling tasks sharing files on heterogeneous clusters.
By: Arnaud Giersch, Yves Robert, Frédéric Vivien
Number: RR2003-28
Date: May 2003
Abstract:
This paper is devoted to scheduling
a large collection of independent tasks onto heterogeneous
clusters. The tasks depend upon (input) files which initially reside
on a master processor. A given file may well be shared by several
tasks. The role of the master is to distribute the files to the
processors, so that they can execute the tasks. The objective for the
master is to select which file to send to which slave, and in which
order, so as to minimize the total execution time. The contribution of
this paper is twofold. On the theoretical side, we establish
complexity results that assess the difficulty of the problem. On the
practical side, we design several new heuristics, which are shown to
perform as efficiently as the best heuristics designed by Casanova et
al., although their cost is an order of magnitude lower.
Steady-state scheduling of task graphs on heterogeneous computing platforms.
By: Olivier Beaumont, Arnaud Legrand, Loris Marchal, Yves Robert
Number: 2003-29
Date: May 2003
Abstract:
In this paper, we consider the execution of a complex
application on a heterogeneous "grid" computing platform. The
complex application consists of a suite of identical, independent
problems to be solved. In turn, each problem consists of a set of
tasks. There are dependences (precedence constraints) between these
tasks. A typical example is the repeated execution of the same
algorithm on several distinct data samples. We use a non-oriented
graph to model the grid platform, where resources have different
speeds of computation and communication. We show how to determine
the optimal steady-state scheduling strategy for each processor (the
fraction of time spent computing and the fraction of time spent
communicating with each neighbor) and how to build such as schedule.
This result holds for a quite general framework, allowing for cycles
and multiple paths in the platform graph.
We consider a non numerable family of colorations induced by discrete
rotations.
The symbolical dynamical system associated with the coloration is first
explained.
We introduce then a group that supports the dynamics of the system. The
periodical
cases are precised, they are induced by Pythagorean triples.
Finally, a proof of the quasi-periodicity of the colorations, and a
description of asymmetrical colorations concludes
this paper.
Multiplication-addition modulaire: algorithmes itératifs et implantations sur FPGA.
By: Jean-Luc Beuchat, Jean-Michel Muller
Number: RR2003-31
Date: Juin 2003
Abstract:
Cet article présente tout d'abord diverses améliorations
d'un algorithme itératif de multiplication modulaire proposé en 1997
par Jeong et Burleson. Une simple modification de la récurrence permet
l'implantation d'une multiplication-addition modulaire. Nous montrons
ensuite comment réduire d'un facteur deux la taille du circuit lorsque
l'opérateur offre le choix du modulo parmi un ensemble $m_1,
m_2,\ldots, m_q$. Nous proposons finalement un nouvel algorithme
facilitant la réalisation de l'exponentiation modulaire et présentons
quelques résultats de synthèse pour des circuits FPGA de la famille
Virtex de Xilinx. Pour des nombres de $16$ bits, les opérateurs
développés permettent par exemple d'effectuer 6 millions d'opérations
à la seconde en utilisant uniquement 17 tranches.
Keywords:
Arithmétique des Ordinateurs, Multiplication Modulaire, FPGA.
Multiple Precision Interval Packages: Comparing Different Approaches.
By: Markus Grimmer, Knut Petras, Nathalie Revol.
Number: RR2003-32
Date: June 2003
Abstract:
We give a survey on packages for multiple precision interval arithmetic,
with the main focus on three specific packages. One is within a Maple
environment, intpakX, and two are C/C++ libraries, GMP-XSC and MPFI.
We discuss their different
features, present timing results and show several applications from
various fields, where high precision intervals are fundamental.
Keywords:
Multiple Precision Interval Arithmetic,
Packages, Ease of use, Efficiency, Reliability.
Optimizing the steady-state throughput of scatter and reduce operations on heterogeneous platforms.
By: Arnaud Legrand, Loris Marchal, Yves Robert
Number: RR2003-33
Date: June 2003
Abstract:
In this paper, we consider the
communications involved by the execution of a complex application,
deployed on a heterogeneous ``grid'' platform. Such applications
intensively use collective macro-communication schemes, such as
scatters, personalized all-to-alls or gather/reduce
operations. Rather than aiming at minimizing the execution time of
a single macro-communication, we focus on the steady-state
operation. We assume that there is a large number of
macro-communication to perform in pipeline fashion, and we aim at
maximizing the throughput, i.e. the (rational) number of
macro-communications which can be initiated every time-step. We
target heterogeneous platforms, modeled by a graph where resources
have different communication and computation speeds. The situation
is simpler for series of scatters or personalized all-to-alls than
for series of reduces operations, because of the possibility of
combining various partial reductions of the local values, and of
interleaving computations with communications. In all cases, we
show how to determine the optimal throughput, and how to exhibit a
concrete periodic schedule that achieves this throughput.
Optimizing the steady-state throughput of Broadcasts on heterogeneous platforms.
By: Arnaud Legrand, Olivier Beaumont, Loris Marchal , Yves Robert
Number: RR2003-34
Date: June 2003
Abstract:
In this paper, we consider the communications involved by the execution of a complex application, deployed on a heterogeneous ``grid'' platform. Such applications extensively use macro-communication schemes, for example to broadcast data items. Rather than aiming at minimizing the execution time of a single broadcast, we focus on the steady-state operation. We assume that there is a large number of messages to be broadcast in pipeline fashion, and we aim at maximizing the throughput, i.e. the (rational) number of messages which can be broadcast every time-step. We target heterogeneous platforms, modeled by a graph where resources have different communication and computation speeds. Achieving the best throughput may well require that the target platform is used in totality: we show that neither spanning trees nor DAGs are as powerful as general graphs. We show how to compute the best throughput using linear programming, and how to exhibit a periodic schedule, first when restricting to a DAG, and then when using a general graph. The polynomial compactness of the description comes from the decomposition of the schedule into several broadcast trees that are used concurrently to reach the best throughput. It is important to point out that a concrete scheduling algorithm based upon the steady-state operation is asymptotically optimal, in the class of all possible schedules (not only periodic solutions).
Optimizing the translation out-of-SSA with renaming constraints.
By: F. Rastello, F. de Ferrière, C. Guillon
Number: RR2003-35
Date: June 2003
Abstract:
Static Single Assignment form is an intermediate representation,
that uses phi-functions to merge values at each confluent points of the control
flow graph. phi-functions are not machine instructions and should be renamed
back to move operations when translating out-of-SSA form. Without
a coalescing algorithm, out-of-SSA translation generates many move
instructions. In this paper we propose an extension of the algorithm of
Leung and George to minimize the phi-related copies during the out-of-SSA
translation. Leung et al. constructed SSA form for programs represented
as native machine instructions, including the use of machine dedicated registers.
For this purpose, the out-of-SSA translation contains renaming constraints
that are represented using a pinning principle. Pinning the phi-function
arguments and their corresponding definition to a common resource is a very
attractive technique for coalescing variables, even if this is not a true
minimization: this article presents a renaming-constraints aware and pinning-based
coalescing algorithm. Even without renaming constraints, the move
instructions minimization problem is still considered an open issue. This
article provides also a discussion about the formulation of this problem,
its complexity and its motivations. Finally, we implemented our algorithm
in the STMicroelectronics Linear Assembly Optimizer. This provides many interesting
results when comparing several possible approaches. We also explain, using
hand crafted examples, the limitations of Leung's, Sreedhar's and classical
register coalescing algorithms.
By combining Kaltofen's 1992 baby steps/giant
steps technique for Wiedemann's 1986 determinant algorithm with Coppersmith's
1994 projections by a block of vectors in the Wiedemann approach and Villard's
1997 analysis of the block technique, we obtain new algorithms for dense matrix
problems of asymptotically fast running time.
The first category of problems is for a dense n x n
matrix A with integer entries. We express the cost in terms of fixed
radix, i.e. bit operations on the exact integers and denote by ||A|| the largest
entry in absolute value. Our algorithms compute the determinant, characteristic
polynomial, Frobenius normal form and Smith normal form of A in softO(n^(3.2)
log ||A||) and softO(n^(2.69726)} log ||A||) bit operations, where the exponent adjustment
sofO(f)=f^(1+o(1)) by ``+o(1)'' captures additional factors C_1 (log n)^(C_2)
(log ||A||)^(C_3) for positive real constants C_1, C_2, C_3 and where the
first, slower asymptotic bit complexity does not require any of the sub-cubic
matrix multiplication algorithms. Our algorithms are randomized, and we can
certify the determinant of A in a Las
Vegas fashion.
The second category of problems deals when the matrix
A has elements from an abstract commutative ring, that is, when no divisions
in the domain of entries are possible. We present algorithms that deterministically
compute the determinant, characteristic polynomial and adjoint of A with softO(n^(3.2))
ring additions, subtractions and multiplications, that without utilizing
sub-cubic matrix multiplication algorithms.With the Coppersmith's and Winograd's
matrix multiplication algorithms our method finds the determinant and adjoint
in O(n^(2.697263)) ring operations and the characteristic polynomial in O(n^(2.806515))
ring operations. We achieve our results in part through new proofs for Villard's
1997 analysis of the block Wiedemann/\allowbreak Lanczos algorithm and a
generalization of the Knuth/Schönhage/Moenck Euclidean remainder sequence
algorithm to matrix polynomials.
Keywords:
Linear algebra, Determinant, Characteristic polynomial, Matrix normal forms, Fast algorithms, Randomized algorithms, Algorithms without divisions.
By: C. Daramy, D. Defour, F. de Dinechin, J.-M. Muller
Number: RR2003-37
Date: July 2003
Abstract:
We present a new elementary function library, called CR-LIBM. This library implements the various functions defined by the Ansi99 C standard. It provides correctly rounded functions. When writing this library, our primarily goal was to certify correct rounding, and make it reasonably fast, and with a low utilisation of memory. Hence, our library can be used without any problem on real-scale problems.
We are also giving the proof and the elements to understand the implementation of the exponential function of CR-LIBM.
AMCA: A linear algorithm for real-time scheduling with optimal
energy use.
By: Bruno Gaujal, Nicolas Navet, Cormac Walsh
Number: RR2003-38
Date: July 2003
Abstract:
We present an algorithm for scheduling
a set of
non-recurrent tasks (or jobs) with real-time constraints so as to
minimize the total energy consumption on a dynamically variable
voltage processor. Our algorithm runs in linear time and is thus an
improvement over the classical algorithm of Yao et
al. It was made possible by considering the
problem as a shortest path problem. We also propose an algorithm for
the case where the processor possesses only a limited number of
clock frequencies. We extend this algorithm to provide the minimum
number of speed changes, which is important when the speed switching
overhead cannot be neglected. All our algorithms are linear in the
number of tasks if the arrivals and deadlines are sorted and need
$O(N\log N)$ time otherwise and these complexities are shown
optimal. We extend the results to fluid tasks and non-convex cost
functions. An interesting by-product of this study is that it
furnishes a linear time feasibility test for a set of non-recurrent
tasks scheduled under EDF. This is an improvement over the classical
one.
Keywords:
Real-Time Systems, Low-Power,
Scheduling, Dynamic Voltage Scaling.
We show that several problems which are known to be
undecidable for probabilistic automata become decidable for quantum
finite automata.
Our main tool is an algebraic result of independent interest: we
give an algorithm which, given a finite number of invertible matrices,
computes the Zariski closure of the group generated by these matrices.
Opérateurs itératifs de multiplication-addition modulaire pour FPGA.
By: Jean-Luc Beuchat, Jean-Michel Muller
Number: RR2003-40
Date: August 2003
Abstract:
Cet article présente tout d'abord diverses améliorations
d'un algorithme itératif de multiplication modulaire proposé en 1997
par Jeong et Burleson. Une simple modification de la récurrence
permet l'implantation d'une multiplication-addition modulaire. Nous
montrons ensuite comment réduire d'un facteur deux la taille du
circuit lorsque l'opérateur offre le choix du modulo parmi un
ensemble $m_1, m_2,\ldots, m_q$. Nous proposons un nouvel algorithme
facilitant la réalisation de l'exponentiation modulaire et
présentons quelques résultats de synthèse pour des circuits FPGA de
la famille Virtex de Xilinx. Pour des nombres de $16$ bits, les
opérateurs développés permettent par exemple d'effectuer 6 millions
d'opérations à la seconde en utilisant uniquement 17 tranches. Ces
différents opérateurs nécessitent toutefois de petites tables
dépendant du modulo $m$ choisi au moment de l'écriture du code VHDL.
Afin de remédier à cet inconvénient, nous proposons une méthode
récursive du calcul des tables s'effectuant parallèlement aux
premières itérations de la multiplication modulaire.
Keywords:
Arithmétique des ordinateurs, multiplication modulaire, FPGA.
Scheduling Divisible Loads on Star and Tree Networks: Results and Open Problems.
By: Olivier Beaumount, Henri Casanova , Arnaud Legrand, Yves Robert, Yang Yang
Number: RR2003-41
Date: September 2003
Abstract:
Applications in many scientific and
engineering domains are structured in large numbers of independent tasks
with low granularity. These applications can thus be naturally
parallelized, typically in master-worker fashion, provided that efficient
scheduling strategies are available. Such applications have been called
divisible loads because a scheduler may divide the computation
among worker processes arbitrarily, both in terms of number of tasks and of
task sizes. Divisible load scheduling has been an active area of research
for the last twenty years. A vast literature offers results and scheduling
algorithms for various models for the underlying distributed computing
platform. Broad surveys are available that report on accomplishments in
the field. By contrast, in this paper we propose a unified theoretical
perspective that synthesizes previously published results, several novel
results, and open questions, in a view to foster novel divisible load
scheduling research. Specifically, we discuss both one-round and
multi-round algorithms, and we restrict our scope to the popular star and
tree network topologies, which we study with both linear and affine cost
models for communication and computation.
Automatic deployment of the Network Weather Service using the Effective Network View.
By: Arnaud Legrand, Martin Quinson
Number: RR2003-42
Date: September 2003
Abstract:
The monitoring infrastructure
constitutes a key component of any Grid middleware. The
Network Weather Service (NWS) is the most commonly used
tool to fulfill this need. Unfortunately, users have to
deploy the NWS manually, which can be very tedious and
error-prone.% This paper introduces a method based on the
Effective Network View (ENV) network mapper to
automatically deploy of NWS using the deployment on our
lab's LAN as lead.
By: Frederic Chavanon, Matthieu Latapy, Michel Morvan and Laurent Vuillon
Number: RR2003-43
Date: September 2003
Abstract:
2D-gons tilings with parallelograms are the main model used in physics
to study quasicrystals, and they are also important in combinatorics
for the study of aperiodic structures.
In this paper, we study the graph induced by the adjacency relation between tiles.
This relation can been used to encode simply and efficiently 2D-gon tilings for algorithmic
manipulation. We show for example how it can be used to sample random 2D-gon tilings.
Contractions of octagonal tilings with rhombic tiles.
By: Frederic Chavanon, Eric Remila
Number: RR2003-44
Date: September 2003
Abstract:
We prove that the space of rhombic tilings of a fixed octagon
can be given a canonical order structure. We make a first
study of this order, proving that it is a graded poset. As a
consequence, we obtain the diameter of the space and a lower bound
for the distance between tilings.
Let us denote by Q(N,\la) the number of solutions
of the diophantine equation A^2+B^2=C^2+C satisfying N\le A\le
B\le C \le \la N-\frac 12. We prove that, for \la fixed and
N\to \iy, there exists a constant \al(\la) such that
Q(N,\la)=\al(\la)N+O_\la\left(N^{7/8}\log N\right). When
\la=2, Q(2^{n-1},2) counts the number of solutions of
A^2+B^2=C^2+C with the same number, n, of binary digits; these
solutions are interesting in the problem of computing the function
(a,b) \to \sqrt{a^2+b^2} in radix-2 floating-point
arithmetic.
By elementary arguments, (N,\la) can be expressed in terms of four sums
of the type
S(u,v;f)=\sum_{\substack{u\le d\le v\\ d \text{ odd}}}
\left(\sum_{\substack{1\le A\le f(d)
\\ 4A^2\equiv -1 \hspace{-1mm}\pmod{d}}} 1\right)
where u and v are real numbers and f: [u,v]\longrightarrow
\R is a function. These sums are estimated by a classical, but
deep, method of number theory, using Fourier analysis and
Kloosterman sums. This method is effective, and, in the case
\la=2, a precise upper bound for |Q(N,\la)-\al(\la)N| is
given.
Mixin modules are a notion of modules that allows cross-module
recursion and late binding, two features missing in ML-style modules.
Unfortunately, they are easier to define in a call-by-name setting.
In a call-by-value setting, mixin modules tend to conflict with the
usual static restrictions on recursive definitions. Moreover, the
semantics of instantiation has to specify an order of evaluation,
which involves a difficult design choice. Previous proposals rely on
the dependencies between components to compute a valid order of
evaluation. In such systems, mixin module types must carry some
information about the dependencies between their components, which
makes them rather impractical. In this paper, we propose a new design
for mixin modules in a call-by-value setting, which avoids this
problem. The formalism we obtain is much simpler than previous notions
of mixin modules, although slightly less powerful.
A new bound on the 2-dimension of partially ordered sets.
By: Eric Thierry
Number: RR2003-47
Date: October 2003
Abstract:
This paper provides a new upper bound on the 2-dimension
of partially ordered sets. The 2-dimension of an ordered set P is the smallest
cardinality of a set S such that there exists an order-embedding of P into
the boolean lattice 2^S (all the subsets of S ordered by inclusion). The
proof is non-contructive and uses a probabilistic argument. We link the result
and the proof with two known theorems of the theory of ordered sets.
Study of a non intrusive and accurate method for measuring the
end-to-end useful bandwidth.
By: Mathieu Goutelle, Pascale Vicat-Blanc/Primet
Number: RR2003-48
Date: October 2003
Abstract:
Studies and tools development for applications
sensitive to data rates is a very active research field for distributed application
performance optimization. The research community works to propose tools for measuring the end-to-end
performance of a link between two hosts. Delay measurements provide a first
approximation but aren't sufficient enough because the delay isn't a relevant metric. A bandwidth
evaluation method would give a more realistic view of the raw capacity but also of the
dynamic behaviour of the interconnection, when we want to evaluate the transfer time
of an amount of data.
Among all the existing methods, there are some differences
according to the measurements strategies and the evaluated metric. We first describe the
available bandwidth measurements and then the total capacity measurements
approaches. Among all the presented methods, none of them can evaluate both metrics, while giving
an overview of the link topology. By using a hop-by-hop packet pair method, we show
that we can provide such informations with a fine analysis of the measurements.
In this report, we detail our proposition of a solution for
an hop-by-hop measurement of the capacity and available bandwidth. This method has been
validated in simulation, then implemented in Linux and validated experimentally. We compare
this method with others to define its limits and the future utilisations on the newly
developed tool.
Keywords:
IP Network Measurements, Capacity Evaluation,
Available Bandwidth, Packet Pair Method, Tracerate.
Scheduling tasks sharing files from distributed repositories.
By: Arnaud Giersch, Yves Robert, Frédéric Vivien
Number: RR2003-49
Date: October 2003
Abstract:
This paper is devoted to scheduling
a large collection of independent tasks onto a large distributed
heterogeneous platform, which is composed of a set of servers. Each
server is a processor cluster equipped with a file repository. The
tasks to be scheduled depend upon (input) files which initially reside
on the server repositories. A given file may well be shared by several
tasks. For each task, the problem is to decide on which server to
execute it, and to transfer the required files (those which the task
depends upon) to that server repository. The objective is to find a
task allocation, and to schedule the induced communications, so as to
minimize the total execution time. The contribution of this paper is
twofold. On the theoretical side, we establish complexity results
that assess the difficulty of the problem. On the practical side, we
design several new heuristics, including an extension of the
\texttt{min-min} heuristic to the decentralized framework, and several
lower cost heuristics, which we compare through extensive simulations.
Effective lower and upper bounds for the Fourier coefficients of powers of the modular invariant j.
By: Nicolas Brisebarre, Georges Philibert
Number: RR2003-50
Date: November 2003
Abstract:
Using an elementary approach based on
careful handlings of Cauchy integrals, we give
precise effective lower and upper bounds for the Fourier
coefficients of powers of the modular invariant j. Moreover,
we adapt an old result of Rademacher to get a convergent series
expansion of these Fourier coefficients and we show that this
expansion allows to find again these estimates.
Our results improve previous ones by K. Mahler and O. Herrmann. In
particular, we show that the Fourier coefficients of j are smaller
than their asymptotically equivalent given by Petersson and
Rademacher.
Keywords:
Modular Function, Modular Invariant,
Fourier Coefficients, Circle Method, Hankel Function.
This paper focus on the deployment
of grid infrastructures, more specifically Problem Solving Environments
(PSE) for numerical applications on the grid. Even if the deployment
of such an architecture is forced by physical constraints
(firewall, access permission, security, ...) its efficiency heavily depends on the quality of the mapping between its different components and the grid resources. This paper proposes a new model based on linear programming to estimate the performance of a deployment of a hierarchical PSE. The advantages of the modeling approach in this case are multiple: evaluate a virtual deployment before an actual deployment, provide a decision builder tool (i.e., designed to compare different architectures or buy new resource), take into account the platform scalability. Using this modeling, it is possible to determine the bottleneck of the platform and thus to know whether a given deployment can be improved or not. We illustrate this modeling by applying this results to an existing hierarchical PSE called DIET.
From Heterogeneous Task Scheduling to Heterogeneous Mixed Data and Task Parallel Scheduling.
By: Frédéric Suter, Henri Casanova, Frédéric Desprez, Vincent Boudet
Number: RR2003-52
Date: November 2003
Abstract:
Mixed-parallelism, the combination of data- and task-parallelism, is a powerful way of increasing the scalability of entire classes of parallel applications. Exploiting both types of parallelism simultaneously makes it possible to deploy these applications on platforms comprising multiple compute clusters, which have become increasingly popular in the last decade. However, high performance application executions are only possible if effective scheduling strategies are available. While multi-cluster platforms are predominantly heterogeneous, previous work on mixed-parallel application scheduling targets only homogeneous platforms. In this paper we develop a method for extending existing scheduling algorithms for task-parallel applications on heterogeneous platforms to the mixed-parallel case. After detailing the foundations of our method and our assumptions, we present a case study in which we generate a mixed-parallel version of the popular HEFT scheduling algorithm, which we evaluate with an extensive set of simulation experiments.
An algorithm for finding entire solutions of systems of difference equations.
By: Nicolas Brisebarre
Number: RR2003-53
Date: November 2003
Abstract:
We present an algorithm that computes the entire solutions of systems
of two difference equations and of systems of one differential
equation and one difference equation, all with complex polynomials
coefficients. The problem of the determination of such solutions arose in
the field of diophantine approximation. Our algorithm, which uses
previous works by Abramov and Petkovsek, allows also to determine, for
each of the systems considered, all the solutions of the form
R1(z) ed1z + ... + Rs(z)
edSz, with d1, ...,
ds in C and R1(z), ...,
Rs(z) in C(z).
A correctly rounded implementation of the exponential function on the Intel Itanium architecture.
By: Christoph Quirin Lauter
Number: RR2003-54
Date: November 2003
Abstract:
This article presents an efficient implementation of a correctly
rounded exponential function in double precision on the Intel
Itanium processor family. This work combines advanced processor
features (like the double-extended precision fused multiply-and-add
units of the Itanium processors) with recent research results giving
the worst-case precision needed for correctly rounding the
exponential function. We give and prove an algorithm which returns
a correctly rounded result (in any of the four IEEE-754 rounding
modes) within 172 machine cycles on the Intel Itanium 2 processor.
This is about four times slower than the less accurate function
present in the standard Intel mathematical library. The evaluation
is performed in one phase only and is therefore fast even in the
worst case, contrary to other implementations which use a multilevel
strategy used by Ziv and Defour: We show that the worst-case required
precision of 157 bits can always be stored in the sum of two
double-extended floating-point numbers. Another algorithm is given
with a 92 cycles execution time, but its proof has to be formally
completed.
Searching for optimal paths in long-range contact networks.
By: Emmanuelle Lebhar, Nicolas Schabanel
Number: RR2003-55
Date: November 2003
Abstract:
Since Milgram experiment in 1967, that
demonstrated the ability of people to find short paths efficiently in networks,
based only on their own local view of the network, different models have been
proposed to study the ``small world phenomenon''. In 2000, Kleinberg shows that
most of the previously known models for small world fail to capture an
important feature of the phenomenon: no local information based (i.e.,
decentralized) algorithm can find short paths in these graphs, when they
exists. He then introduces a model composed of a lattice (representing the
local acquaintances) augmented with directed links (symbolizing long-range
contacts) distributed harmonically. He proposes to model people behavior by the
greedy algorithm that forwards the message to the contact (local or long-range)
of the current holder, which is the closest to its destination. He shows that
this greedy algorithm computes paths of expected length $\Theta(\log^2n)$
between any pair of nodes.
The
present paper questions the choice of the greedy strategy to model social
behavior. We proposes a new strategy in which nodes consult their acquaintances
near by, before deciding where to forward the message. Our algorithm presents
the same computational characteristics as Kleinberg's original algorithm: uses
$\Theta(\log n)$ bits of memory and visits $O(\log^2n)$ nodes. However, it
computes much shorter paths, of expected length $O(\log n(\log \log n)^2)$,
between any pair of nodes. This algorithm demonstrates that consulting their
acquaintances near by, leads to considerable improvements in performances. It
also shows that this consultation is less and less useful as one gets closer to
the target. As far as we know, this is the first algorithm to ``break the
$\Theta(\log^2 n)$ barrier'' for the paths length in Kleinberg's network.
Keywords:
Algorithms on Random Structures, Routing, Small World and
Social Behavior Models.
Transparent Remote File Access through a Shared Library Client.
By: Brice Goglin, Loïc Prylli
Number: RR2003-56
Date: December 2003
Abstract:
This paper presents the implementation of the ORFA client.
ORFA aims at providing an efficient access to remote file systems
through high-speed local networks such as Myrinet.
The ORFA client is a lightweight shared library that may be
pre-loaded to override standard file access routines to allow remote
file access for any legacy application.
In ORFA, virtual file descriptors have been designed to support
POSIX behavior such as file sharing semantics so that remote files
may be accessed and manipulated as local files.
Local file access routines may still be used without any
incompatibility with other libraries that modify their standard
behavior.
Finally, a network abstraction layer has been implemented in ORFA
to efficiently use asynchronous interfaces such as GM without
suffering from memory registration requirements.