### Valiant's model and the cost of computing integers.

**By:**Pascal Koiran**Number:**RR2004-01**Date:**January 2004**Abstract:**

- Let tau(k) be the minimum number of arithmetic operations required to build the integer k from the constants 1 and 2. A sequence (x_k) is said to be ``easy to compute'' if tau(x_k) is bounded by a polynomial function of log(k). It is natural to conjecture that sequences such as $\lfloor 2^n \ln 2 \rfloor$ or n! are not easy to compute. In this paper we show that a proof of this conjecture for the first sequence would imply a superpolynomial lower bound for the arithmetic circuit size of the permanent polynomial. For the second sequence, a proof would imply a superpolynomial lower bound for the permanent or P different from PSPACE.

**Keywords:**

- Valiant's Model, Permanent, Factorials, Arithmetic Circuits.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+13p**Format:**Compressed postscript- Get it

### FFPACK: Finite field linear algebra package.

**By:**J-G.Dumas, P.Giorgi, C.Pernet**Number:**RR2004-02**Date:**January 2004**Abstract:**

- The FFLAS project has established that exact matrix multiplication over finite fields can be performed at the speed of the highly optimized numerical BLAS routines. Since many algorithms have been reduced to use matrix multiplication in order to be able to prove an optimal theoretical complexity, this paper shows that those optimal complexity algorithms, such as LSP factorization, rank determinant and inverse computation can also be the most efficient.

**Keywords:**

- Word size Finite fields, BLAS level 1-2-3, Linear Algebra Package, Matrix Multiplication, LSP Factorization.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+17p**Format:**pdf- Get it

### Transport haute-performance: problématique et solutions proposées.

**By:**Marc Herbert, Pascale Vicat-Blanc Primet**Number:**RR2004-03**Date:**January 2004**Abstract:**

- The high-performance computing field has discovered some limitations in the TCP protocol, preventing it to fully use the wide area high capacity links provisioned for computing grids. This article explain the issues, then describes and compares the solutions currently proposed by the networking community.

**Keywords:**

- Transport, High-performance, TCP/IP, Ethernet.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+19p**Format:**Compressed postscript- Get it

### Scheduling tasks sharing files from distributed repositories (revised version).

**By:**Arnaud Giersch, Yves Robert, Frédéric Vivien**Number:**RR2004-04**Date:**February 2004**Abstract:**

- This paper is devoted to scheduling a large collection of independent tasks onto a large distributed heterogeneous platform, which is composed of a set of servers. Each server is a processor cluster equipped with a file repository. The tasks to be scheduled depend upon (input) files which initially reside on the server repositories. A given file may well be shared by several tasks. For each task, the problem is to decide which server will execute it, and to transfer the required files (those which the task depends upon) to that server repository. The objective is to find a task allocation, and to schedule the induced communications, so as to minimize the total execution time. The contribution of this paper is twofold. On the theoretical side, we establish complexity results that assess the difficulty of the problem. On the practical side, we design several new heuristics, including an extension of the \texttt{min-min} heuristic to the decentralized framework, and several lower cost heuristics, which we compare through extensive simulations. This report is a revised version of the LIP research report no. 2003-49 / INRIA research report no. 4976, which it replaces.

**Keywords:**

- Scheduling, Heterogeneous Clusters, Grid, Independent Tasks, File-sharing, Heuristics.

**Availability:**Not available by ftp.**Citation:**Not published yet.**Size:**2+36p**Format:**Compressed postscript- Get it

### Listing all the minimal separators of a 3-connected planar graph.

**By:**Frédéric Mazoit**Number:**RR2004-05**Date:**February 2004**Abstract:**

- I present an efficient algorithm which lists the minimal separators of a 3-connected planar graph in $O(n)$ per separator.

**Keywords:**

- 3-Connected Planar Graphs, Minimal Separator Enumeration.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+8p**Format:**pdf- Get it

### Accommodating Heterogeneity in a Multicast Session Through a Receiver-based Data Replication Scheme.

**By:**Moufida Maimour**Number:**RR2004-06**Date:**February 2004**Abstract:**

- A multicast session can involve multiple receivers with different capacities. To accommodate this heterogeneity, we propose a new replication mechanism which allows for a fine-grained multi-rate congestion control. In our scheme, some receivers (replicators) are responsible for data replication to a subset of receivers with lower capacity. A replicator, in the same way as a single-rate multicast source, adapts its rate depending on feedback it receives from the members of its associated subgroup. A simple partitioning algorithm is proposed to split a set of receivers into subgroups of similar capacities. This algorithm does not rely on a prior knowledge of the receivers' capacities and is executed on-the-fly as soon as necessary feedback are collected. To be more scalable and fairer with other sessions while improving the receivers satisfaction, we suggest to execute the partitioning algorithm at the routers. Analysis and simulations are performed in order to evaluate our approach, mainly by comparing it to traditional (source-based) replication schemes. Using ns, preliminary simulation results show the rapid convergence of the partitioning algorithm. Fairness of our scheme toward other flows is also dealt with.

**Keywords:**

- Reliable Multicast, Congestion Avoidance, Heterogeneity.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+22p**Format:**Compressed postscript- Get it

### Complexity results and heuristics for pipelined multicast operations on heterogeneous platforms.

**By:**Olivier Beaumont, Arnaud Legrand, Loris Marchal, Yves Robert**Number:**RR2004-07**Date:**February 2004**Abstract:**

- In this paper, we consider the communications involved by the execution of a complex application deployed on a heterogeneous platform. Such applications extensively use macro-communication schemes, for example to broadcast data items to several targets, known as the multicast operation. Rather than seeking to minimize the execution time of a single multicast, we focus on steady-state performance. We target heterogeneous platforms, modeled by a graph where resources have different communication speeds. We show that the problem of computing the best throughput for a multicast operation is NP-hard, whereas the best throughput to broadcast a message to every node in a graph can be computed in polynomial time. Thus we introduce several heuristics to deal with this problem; most of them are based on linear programming. We prove that some of these heuristics are approximation algorithms. We perform simulations to test these heuristics and show that their results are close to a theoretical upper bound on the throughput that we obtain with the linear programming approach.

**Keywords:**

- Scheduling, Steady State, Heterogeneous Platforms, Complexity.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+32p**Format:**Compressed postscript- Get it

### Explicit Pure Type Systems for the Lambda-Cube.

**By:**Romain Kervarc, Pierre Lescanne**Number:**RR2004-08**Date:**February 2004**Abstract:**

- Pure type systems are a general formalism allowing to represent many type systems -- in particular, Barendregt's lambda-cube, including Girard's system F, dependent types, and the calculus of constructions. We built a variant of pure type systems by adding a cut rule associated to an explicit substitution in the syntax, according to the Curry-Howard-de Bruijn correspondence. The addition of the cut requires the addition of a new rule for substitutions, with which we are able to show type correctness and subject reduction for all explicit systems. Moreover, we proved that the explicit lambda-cube obtained this way is strongly normalizing.

**Keywords:**

- Lambda-Calculus, Explicit Substitutions, Pure Type Systems, Higher Order Types, Calculus of Constructions, Lambda-Cube, Strong Normalization.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+45p**Format:**Compressed postscript- Get it

### Abstract geometrical computation: Turing-computing ability and unpredictable accumulations (extended abstract).

**By:**J. Durand-Lose**Number:**RR2004-09**Date:**March 2004**Abstract:**

- In the Cellular Automata (CA) literature, discrete lines inside (discrete) space-time diagrams are often idealized as Euclidean lines in order to analyze a dynamics or to design CA for special purposes. In this article, we present an analog model of computation corresponding to this abstraction: dimensionless signals are moving on a continuous space in continuous time generating Euclidean lines on (continuous) space-time diagrams. Like CA, this model is parallel, synchronous, uniform in space and time, and uses local updating. The main difference is that space and time are continuous and not discrete (i.e. R instead of N). In this article, the model is restricted to Q in order to remain inside Turing-computation theory. We prove that our model can carry out any Turing-computation through two-counter automata simulation. Because of the nature of time, Zeno phenomena, i.e. accumulations, can happen. By reducing the NTOT problem (whether a recursive function is not total), we prove that forecasting any accumulation is $\Sigma^0_2$-complete in the arithmetical hierarchy, i.e. totally undecidable.

**Keywords:**

- Abstract Geometrical Computation, Analog Model of Computation, Arithmetic Hierarchy, Cellular Automata, Geometry, Turing Universality, Zeno Phenomena.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+11p**Format:**Compressed postscript- Get it

### Fast correct rounding of elementary functions in double precision using double-extended arithmetic.

**By:**Florent de Dinechin, David Defour, Christoph Lauter**Number:**RR2004-10**Date:**March 2004**Abstract:**

- This article shows that IEEE-754 double-precision correct rounding of the most common elementary functions (exp/log, trigonometric and hyperbolic) is achievable on current processors using only double-double-extended arithmetic. This allows to improve by several orders of magnitude the worst case performance of a correctly-rounded mathematical library, compared to the current state of the art. This article builds up on previous work by Lefèvre and Muller, who have shown that an intermediate accuracy of up to 158 bits is required for the evaluation of some functions. We show that the practical accuracy required can always be reduced to less than 119 bits, which is easy to obtain using well-known and well-proven techniques of double-double-extended arithmetic. As an example, a prototype implementation of the exponential function on the Itanium has a worst-case time about twice that of the standard, highly optimized : libm by Intel, which doesn't offer correct rounding. Such a small performance penalty should allow correct rounding of elementary functions to become the standard.

**Keywords:**

- Elementary Functions, Correct Rounding, IEEE-754, Double-extended Precision.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+12p**Format:**pdf- Get it

### Steady-State Scheduling on Heterogeneous Clusters: Why and How?.

**By:**Olivier Beaumont, Arnaud Legrand, Loris Marchal, Yves Robert**Number:**RR2004-11**Date:**March 2004**Abstract:**

- In this paper, we consider steady-state scheduling techniques for heterogeneous systems, such as clusters and grids. We advocate the use of steady-state scheduling to solve a variety of important problems, which would be too difficult to tackle with the objective of makespan minimization. We give a few successful examples before discussing the main limitations of the approach.

**Keywords:**

- Scheduling, Steady-state, Heterogeneous Platforms.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+11p**Format:**Compressed postscript- Get it

### Performance Evaluation of DyRAM using NS.

**By:**Moufida Maimour**Number:**RR2004-12**Date:**March 2004**Abstract:**

- Since it is required by many emerging Internet applications, reliable multicast has received significant attention in the networking research community. Scalability to a large number of receivers is one of the key challenges to be met by reliable multicast protocols. We previously proposed a reliable multicast protocol called DyRAM that benefits from the assistance of routers in order to implement its services. In this report, we consider a simulation-based performance evaluation of DyRAM. We mainly compare it to SRM and a native reliable multicast protocol. We show that DyRAM allows for less load in terms of consumed bandwidth on the different multicast nodes and that the different services allows for reducing the repair latency and implosion. We also consider the active routers density and show that increasing the density of active routers allows for better performances. Comparing DyRAM to SRM allows for showing the load balancing feature and scalability of our approach.

**Keywords:**

- Reliable Multicast, Simulation, Performance Evaluation, NS.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+16p**Format:**Compressed postscript- Get it

### Second Order Function Approximation with a Single Small Multiplication.

**By:**Jérémie Detrey, Florent de Dinechin**Number:**RR2004-13**Date:**March 2004**Abstract:**

- This paper presents a new scheme for the hardware evaluation of elementary functions, based on a piecewise second order minimax approximation. The novelty is that this evaluation requires only one small rectangular multiplication. Therefore the resulting architecture combines a small table size, thanks to second-order evaluation, with a short critical path: Consisting of one table lookup, the rectangular multiplication, and one addition, the critical path is shorter than that of a plain first-order evaluation. Synthesis results for several functions show that this method outperforms all the previously published methods in both area and speed for precisions ranging from 12 to 24 bits.

**Keywords:**

- Computer Arithmetic, Harware Elementary Functions Evaluation.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+8p**Format:**Compressed postscript- Get it

### FPGA Implementation of a Recently Published Signature Scheme.

**By:**Jean-Luc Beuchat, Nicolas Sendrier, Arnaud Tisserand, Gilles Villard**Number:**RR2004-14**Date:**March 2004**Abstract:**

- An algorithm producing cryptographic digital signatures less than 100 bits long with a security level matching nowadays standards has been recently proposed by Courtois, Finiasz, and Sendrier. This scheme is based on error correcting codes and consists in generating a large number of instances of a decoding problem until one of them is solved (about 9!=362880 attempts are needed). A careful software implementation requires more than one minute on a 2GHz Pentium 4 for signing. We propose a first hardware architecture which allows to sign a document in 0.86 second on an XCV300E-7 FPGA, hence making the algorithm practical.

**Keywords:**

- Cryptography, digital signature, code-based cryptosystems, FPGA implementation.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+8p**Format:**Compressed postscript- Get it

### Abstract geometrical computation for Black hole computation (extended abstract).

**By:**J. Durand-Lose**Number:**RR2004-15**Date:**April 2004**Abstract:**

- The Black hole model of computation provides a computing power that goes beyond the classical Turing computability since it offers the possibility to decide in finite time any recursively enumerable (\RE) problem. In this article, we provide a geometric model of computation, conservative abstract geometrical computation, that has the same property: it can simulate any Turing machine and can decide any \RE problem through the creation of an accumulation. Finitely many signals can leave any accumulation, and it can be known whether anything leaves. This corresponds to a black hole artifact.

**Keywords:**

- Abstract geometrical computation, Black hole model, Energy conservation, Malament-Hogarth space-times, Turing universality, Zeno phenomena.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+11p**Format:**pdf- Get it

### Procedure placement using temporal-ordering information: dealing with code size expansionin.

**By:**Thierry Bidault, Christophe Guillon, Florent Bouchez, Fabrice Rastello**Number:**RR2004-16**Date:**April 2004**Abstract:**

- Instruction cache performance is one of the bottle-necks of processor performance. In this paper, we study the effects of procedure placement in memory on a direct-mapped instruction cache. These caches differ from associative memory caches by the fact that each address in the memory is assigned to one and only one address in the cache. This means that two procedures with addresses that share the same place in the cache, and that are called alternatively will create a conflict-miss: one will overwrite the other in the cache. The goal of procedure placement is to minimize these cache-misses. Pettis and Hansen give in [PH] a greedy algorithm that doesn't increase the code size. The Gloy and Smith algorithm [TRG] greatly decreases the number of cache-misses but authorizes gaps between procedures, hence it increases the code size. The latter comprises of two main stages: in the ``cache-placement'' phase, the procedures are given the location they will occupy in the instruction cache; in the ``memory-placement'' phase, procedures are placed in memory in such a way that code expansion is minimized, with the constraints of their cache placement. In this article, we prove the NP-completeness of the first stage, and the polynomiality of the second stage of [TRG]. Indeed, we show that our algorithm provides the optimal solution in a time complexity of O(nL\log^*(L+n)) where n is the number of procedures, and L the cache size. Thus nearly linear for a fixed cache size. We also provide an algorithm which, given the cache-placement, quickly returns an approximation of the code expansion. This makes the cache-placement stage take into consideration the final program size. Our modifications to the Gloy and Smith algorithm give on average a code size expansion of 8 % over the original program size, while the initial algorithm gave an expansion of 177 %. The cache miss reduction is nearly the same as the Gloy and Smith solution with 35 % cache miss reduction.

**Keywords:**

- Instruction cache, code placement, code size, cache miss, min-matching, hamiltonian-path, profiling.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**3+25p**Format:**Compressed postscript- Get it

### Memory-based scheduling for a parallel multifrontal solver.

**By:**Abdou Guermouche, Jean-Yves L'Excellent**Number:**RR2004-17**Date:**April 2004**Abstract:**

- The memory usage of sparse direct solvers can be the bottleneck to solve large-scale problems. This paper describes dynamic scheduling strategies that aim at reducing the memory usage of a parallel direct solver. Combined to static modifications of the tasks dependency graph, experiments show that such techniques have a good potential to improve the memory usage of a parallel multifrontal solver, MUMPS.

**Keywords:**

- Sparse matrices, multifrontal method, scheduling, memory.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+14p**Format:**Compressed postscript- Get it

### Cycle Accurate Simulation Model Generation for SoC Prototyping.

**By:**Antoine Fraboulet, Tanguy Risset, Antoine Scherrer**Number:**RR2004-18**Date:**April 2004**Abstract:**

- We present new results concerning the integration of high level designed IPs into a complete System on Chip. We first introduce a new computation model that can be used for cycle accurate simulation of register transfer level synthesized hardware. Then we provide simulation of a SoC integrating a data-flow IP synthesized with MMAlpha and the SocLib cycle accurate simulation environment. This integration also validates an efficient generic interface mechanism for {data-flow} IPs.

**Keywords:**

- System on chip, High level synthesis, VLSI, Circuit simulation, Computer aided design tools.

**Availability:**Electronic copy only.**Citation:**Published in Samos IV (Systems, Architectures, MOdeling, and Simulation, July 2004).**Size:**2+22p**Format:**pdf- Get it

### Scalable and Modular Scheduling.

**By:**Paul Feautrier**Number:**RR2004-19**Date:**April 2004**Abstract:**

- Scheduling a program (i.e. constructing a timetable for the execution of its operations) is one of the most powerful methods for automatic parallelization. A schedule gives a blueprint for constructing a synchronous program, suitable for an ASIC or VLIW processor. However, constructing a schedule entails solving a large linear program. Even if one accept the (experimental) fact that the Simplex is almost always polynomial, the scheduling time is of the order of a large power of the program size. Hence, the method does not scale well. The present paper proposes two methods for improving the situation. Firstly, a big program can be divided in smaller units (processes) which can be scheduled separately. This is \em modular scheduling Second, one can use projection methods for solving linear programs incrementatly. This is specially efficient if the dependence graph is sparse.

**Keywords:**

- Scheduling, Modularity, Parallelization.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+17p**Format:**Compressed postscript- Get it

### Assessing the impact and limits of steady-state scheduling for mixed task and data parallelism on heterogeneous platforms.

**By:**Olivier Beaumont, Arnaud Legrand, Loris Marchal, Yves Robert**Number:**RR2004-20**Date:**April 2004**Abstract:**

- In this paper, we consider steady-state scheduling techniques for mapping a collection of application graphs onto heterogeneous systems, such as clusters and grids. We advocate the use of steady-state scheduling to solve this difficult problem. While the most difficult instances are shown to be NP-complete, most situations of practical interest are amenable to a periodic solution which can be described in compact form (polynomial size) and is asymptotically optimal.

**Keywords:**

- Mixed parallelism, Scheduling, steady-state, heterogeneous platforms.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+43p**Format:**Compressed postscript- Get it

### A realistic network/application model for scheduling divisible loads on large-scale platforms.

**By:**Loris Marchal, Yang Yang, Henri Casanova, Yves Robert**Number:**RR2004-21**Date:**April 2004**Abstract:**

- Divisible load applications consist of an amount of data and associated computation that can be divided arbitrarily into any number of independent pieces. This model is a good approximation of many real-world scientific applications, lends itself to a natural master-worker implementation, and has thus received a lot of attention. The critical issue of divisible load scheduling has been studied extensively in previous work. However, only a few authors have explored the simultaneous scheduling of multiple such applications on a distributed computing platform. We focus on this increasingly relevant scenario and make the following contributions. We use a novel and more realistic platform model that captures some of the fundamental network properties of Grid platforms. We formulate a steady-state multi-application scheduling problem as a linear program that expresses some notion of fairness between applications. This scheduling problem is NP-complete and we propose several heuristics that we evaluate and compare via extensive simulation experiments conducted over 250,000 platform configurations. Our main finding is that some of our heuristics can achieve performance close to the optimal and we quantify the trade-offs between achieved performance and heuristic complexity.

**Keywords:**

- parallel computing, scheduling, divisible load, bandwidth sharing, resource sharing, multiple applications.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+20p**Format:**Compressed postscript- Get it

### Independent and Divisible Task Scheduling on Heterogeneous Star-shaped Platforms with Limited Memory.

**By:**Olivier Beaumont, Arnaud Legrand, Loris Marchal, Yves Robert**Number:**RR2004-22**Date:**April 2004**Abstract:**

- In this paper, we consider the problem of allocating and scheduling a collection of independent, equal-sized tasks on heterogeneous star-shaped platforms. We also address the same problem for divisible tasks. For both cases, we take memory constraints into account. We prove strong NP-completeness results for different objective functions, namely makespan minimization and throughput maximization, on simple star-shaped platforms. We propose an approximation algorithm based on the unconstrained version (with unlimited memory) of the problem. We introduce several heuristics, which are evaluated and compared through extensive simulations. An unexpected conclusion drawn from these experiments is that classical scheduling heuristics that try to greedily minimize the completion time of each task are outperformed by the simple heuristic that consists in assigning the task to the available processor that has the smallest communication time, regardless of computation power (hence a "bandwidth-centric" distribution).

**Keywords:**

- Scheduling, makespan, steady-state, divisible load, memory constraints, bounded buffers, memory limitation .

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+22p**Format:**Compressed postscript- Get it

### Lattice-Based Memory Allocation.

**By:**Alain Darte, Rob Schreiber, Gilles Villard**Number:**RR2004-23**Date:**April 2004**Abstract:**

- We investigate the technique of storing multiple array elements in the same memory cell, with the goal of reducing the amount of memory used by an array variable. This reduction is both important and achievable during the synthesis of a dedicated processor or code generation for an architecture with a software-controlled scratchpad memory. In the former case, a smaller, less expensive circuit results; in the latter, scratchpad space is saved for other uses, other arrays most likely. The key idea is that once a schedule of operations has been determined, the schedule of references to a given location is known, and elements with disjoint lifetimes may share a single memory cell, in principle. The difficult problem is one of code generation: how does one generate memory addresses in a simple way, so as to achieve a nearly best possible reuse of memory? Previous approaches to memory reuse for arrays consider some particular affine (with modulo expressions) mapping of indices, representing the data to be stored, to memory addresses. We generalize the idea, and develop a mathematical framework based on critical integer lattices that subsumes all previous approaches and gives new insights into the problem.We place the problem in a broader mathematical context, showing its relation to real critical lattices, successive minima, and lattice basis reduction; finally, we propose and analyze various strategies for lattice-based memory allocation.

**Keywords:**

- Program transformation, memory allocation, memory size reduction, admissible lattice, critical determinant, successive minima.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+43p**Format:**Compressed postscript- Get it

### Optimizations of Client's side communications in a Distributed File System within a Myrinet Cluster.

**By:**Brice Goglin, Loïc Prylli, Olivier Glück**Number:**RR2004-24**Date:**April 2004**Abstract:**

- This paper presents a study of the interaction between high-speed interconnects and a distributed file system client. We use our ORFA remote file access protocol and Myrinet network with the GM software layer to show how the highly specific programming model of high-speed interconnects may be used in a high performance distributed file system client. We present both a user-level and a kernel-level client implementations with either buffered or non-buffered accesses. These implementations show high performances and may be transparently used by applications. Our improvements focus on pin-down cache techniques and memory registration issues. Our GM modifications show no impact on its performances while the network usage in the distributed file system client is improved.

**Keywords:**

- High-Speed Local Network, Distributed File System, Cluster, Memory Registration, Myrinet, Linux.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+15p**Format:**pdf- Get it

### Coherent load information mechanisms for distributed dynamic scheduling.

**By:**Abdou Guermouche (ENS Lyon), Jean-Yves L'Excellent (INRIA)**Number:**RR2004-25**Date:**April 2004**Abstract:**

- We consider a distributed system where processes can only communicate by message passing and need a coherent view of the load (e.g., workload, memory) of others to take dynamic decisions (scheduling). We present several mechanisms to obtain distributed estimates of such information and experiment them in the context of a real application, an asynchronous parallel solver for large sparse systems of linear equations.

**Keywords:**

- Snapshot, Distributed system, Dynamic scheduling, Load balancing, Message passing.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+12p**Format:**Compressed postscript- Get it

### Optimal memory minimization algorithms for the multifrontal method.

**By:**Abdou Guermouche (ENS Lyon), Jean-Yves L'Excellent (INRIA)**Number:**RR2004-26**Date:**April 2004**Abstract:**

- We are interested in the active and total memory usage of the multifrontal method. Starting from the algorithms proposed by Liu, we suggest a new scheme together with a tree traversal that give an optimal peak of active memory. Significant gains are obtained compared to Liu's approach. We also study the problem of minimizing the total memory and compare various new schemes. A number of experiments shows the interest of these approaches.

**Keywords:**

- Liu's algorithm, multifrontal method, sparse matrices, memory, tree traversal.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+22p**Format:**Compressed postscript- Get it

### Mechanizing epistemic logic with Coq.

**By:**Pierre Lescanne**Number:**RR2004-27**Date:**May 2004**Abstract:**

- I present a mechanization of epistemic logic, also called knowledge logic, I have done using Coq. This work includes a formalization in Coq of epistemic logic and two case studies. This report is an updating of LIP report RR2001-12.

**Keywords:**

- Epistemic Logic, Mechanizing Logic, Modal Logic.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+14p**Format:**pdf- Get it

### Data redistribution algorithms for heterogeneous processor rings.

**By:**Hélène Renard, Yves Robert, Frédéric Vivien**Number:**RR2004-28**Date:**May 2004**Abstract:**

- We consider the problem of redistributing data on homogeneous and heterogeneous ring of processors. The problem arises in several applications, each time after that a load-balancing mechanism is invoked (but we do not discuss the load-balancing mechanism itself). We provide algorithms that aim at optimizing the data redistribution, both for uni-directional and bi-directional rings, and we give complete proofs of correctness. One major contribution of the paper is that we are able to prove the optimality of the proposed algorithms in all cases except that of a bi-directional heterogeneous ring, for which the problem remains open.

**Keywords:**

- Heterogeneous Rings, Data Redistribution Algorithms, Load-balancing.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+26p**Format:**Compressed postscript- Get it

### Characterizing strong normalization in a language with control operators.

**By:**Dan Dougherty, Silvia Ghilezan, Pierre Lescanne**Number:**RR2004-29**Date:**May 2004**Abstract:**

- We investigate some fundamental properties of the reduction relation in the untyped term calculus derived from Curien and Herbelin's lambda-mu-mu~. The original lambda-mu-mu~ has a system of simple types, based on sequent calculus, embodying a Curry-Howard correspondence with classical logic; the significance of the untyped calculus of raw terms is that it is a Turing-complete language for computation with explicit representation of control as well as code. We define a type assignment system for the raw terms satisfying: a term is typable if and only if it is strongly normalizing. The intrinsic symmetry in the lambda-mu-mu~ calculus leads to an essential use of both intersection and union types; in contrast to other union-types systems in the literature, our system enjoys the Subject Reduction property.

**Keywords:**

- Classical Logic, Sequent Calculus, Semantic, Lambda Calculus, Continuations, Strong Normalization.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+26p**Format:**pdf- Get it

### Rhombus tilings:decomposition and space structure.

**By:**Frederic Chavanon, Eric Remila**Number:**RR2004-30**Date:**June 2004**Abstract:**

- We study the
spaces of rhombus tilings, i.e. the graphs whose vertices are
tilings of a fixed zonotope, and two tilings are linked if one can
pass from one to the other one by a local transformation, called
flip.

We first use a decomposition method to encode rhombus tilings and give a useful characterization for a sequence of bits to encode a tiling.

In codimension 2, we use the previous coding to get a canonical representation of tilings, and an order structure on the space of tilings, which is shown to be a graded poset, from which connectivity is deduced.

**Keywords:**

- Tilings, Order, Structure.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+21p**Format:**pdf- Get it

### A tool for unbiased comparison between logarithmic and floating-point arithmetic.

**By:**Jérémie Detrey, Florent de Dinechin**Number:**RR2004-31**Date:**June 2004**Abstract:**

- This paper describes two concurrent libraries of parameterized arithmetic operators for manipulating high-dynamic numbers on FPGAs. One of them uses a floating-point representation, the other one uses a logarithmic representation. Along with their direct interest, those two libraries allow application-specific comparisons of those two number representation systems. They are unbiased in the sense that they tend to reflect the state-of-the-art for both number arithmetic systems, and are freely available at http://www.ens-lyon.fr/LIP/Arenaire/.

**Keywords:**

- Arithmetic, Floating-Point, Logarithmic Number system, LNS, Hardware Operators, FPGA, VHDL.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+16p**Format:**Compressed postscript- Get it

### Pipelining Broadcasts on Heterogeneous Platforms under the One-Port Model.

**By:**Olivier Beaumont, Loris Marchal**Number:**RR2004-32**Date:**July 2004**Abstract:**

- In this paper, we consider the communications involved by the execution of a complex application, deployed on a heterogeneous platform. Such applications extensively use macro-communication schemes, for example to broadcast data items. Rather than aiming at minimizing the execution time of a single broadcast, we focus on the steady-state operation. We assume that there is a large number of messages to be broadcast in pipeline fashion, and we aim at maximizing the throughput, i.e. the (rational) number of messages which can be broadcast every time-step. We target heterogeneous platforms, modeled by a graph where resources have different communication speeds under the unidirectional one-port model (i.e. at a given time step, a processor can be involved in at most one (incoming or outgoing) communication with one of its neighbors). Achieving the best throughput may well require that the target platform is used in totality: we show that neither spanning trees nor DAGs are as powerful as general graphs. We propose a rather sophisticated polynomial algorithm for determining the optimal throughput that can be achieved using a platform, together with a (periodic) schedule achieving this throughput. The algorithm is based on the use of polynomial oracles and of the ellipsoid method for solving in linear programs in rational numbers. The polynomial compactness of the description comes from the decomposition of the schedule into several broadcast trees that are used concurrently to reach the best throughput. It is important to point out that a concrete scheduling algorithm based upon the steady-state operation is asymptotically optimal, in the class of all possible schedules (not only periodic solutions).

**Keywords:**

- Scheduling, steady-state, heterogeneous platforms, ellipsoid method.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+22p**Format:**Compressed postscript- Get it

### Deadline Scheduling with Priority for Client-Server Systems on the Grid.

**By:**Eddy Caron, Pushpinder Kaur Chouhan, Frederic Desprez**Number:**RR2004-33**Date:**July 2004**Abstract:**

- We present algorithms for the scheduling sequential tasks on a Network Enabled Server (NES) environment. We have implemented the non-preemptive scheduling, since at the user level we cannot interrupt a running process and block it in order to allow a new process to run. This article is an extension of the paper: ``A Study of Deadline Scheduling for Client-Server Systems on the Computational Grid'' by Takefusa et al. We mainly discuss a deadline scheduling with priority strategy that is more appropriate for multi-client, multi-server case. Importance is first given to the task's priority and then the task is allocated to the server that can meet the task's deadline. This may cause that some already allocated tasks on the server miss their deadline. We augment the benefits of scheduling algorithms with load measurements (which is done with the use of a forecasting tool called FAST) and fallback mechanisms. The experimental results shows that the deadline scheduling with priority along with fallback mechanism can increase the overall number of tasks executed by the NES.

**Keywords:**

- Scheduling, Problem Solving Environment.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+9p**Format:**Compressed postscript- Get it

### A general scheme for deciding the branchwidth.

**By:**Frédéric Mazoit**Number:**RR2004-34**Date:**June 2004**Abstract:**

- We adapt some decision theorems about treewidth to the branchwidth and use this theorems to prove that the branchwidth of circular-arc graphs can be computed in polynomial time.

**Keywords:**

- Graphs, Branchwidth, Circular-arc graphs.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+11p**Format:**pdf- Get it

### Listing all the minimal separators of a planar graph.

**By:**Frédéric Mazoit**Number:**RR2004-35**Date:**June 2004**Abstract:**

- I present an efficient algorithm which lists the minimal separators of a planar graph in O(n) per separator.

**Keywords:**

- Planar Graphs, Minimal Separator Enumeration.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+10p**Format:**pdf- Get it

### Generating formally certified bounds on values and round-off errors.

**By:**Marc Daumas, Guillaume Melquiond**Number:**RR2004-36**Date:**July 2004**Abstract:**

- We present a new tool that generates bounds on the values and the round-off errors of programs using floating point operations. The tool is based on forward error analysis and interval arithmetic. The novelty of our tool is that it produces a formal proof of the bounds that can be checked independently using an automatic proof checker such as Coq and a complete model of floating point arithmetic. For the first time ever, we can easily certify that simple numerical programs such as the ones usually found in real time applications do not overflow and that round-off errors are below acceptable thresholds. Such level of quality should be compulsory on safety critical applications. As our tool is easy to handle, it could also be used for many pieces of software.

**Keywords:**

- Round-off error, Overflow, Formal proof, Certification, Safety critical.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+24p**Format:**Compressed postscript- Get it

### A floating-point library for integer processors.

**By:**C. Bertin, N. Brisebarre, B. Dupont de Dinechin, C.-P. Jeannerod, C. Monat, J.-M. Muller, S. Raina, A. Tisserand**Number:**RR2004-37**Date:**July 2004**Abstract:**

- This paper presents a C library for the software support of single precision floating-point (FP) arithmetic on processors without FP hardware units such as VLIW or DSP processor cores for embedded applications. This library provides several levels of compliance to the IEEE 754 FP standard. The complete specifications of the standard can be used or just some relaxed characteristics such as restricted rounding modes or computations without denormal numbers. This library is evaluated on the ST200 VLIW processors from STMicroelectronics.

**Keywords:**

- Computer arithmetic, floating-point arithmetic, addition, multiplication, division, square-root, integer processor, VLIW, DSP.

**Availability:**Electronic copy only.**Citation:**Published in ADVANCED SIGNAL PROCESSING ALGORITHMS, ARCHITECTURES, AND IMPLEMENTATIONS XIV, SPIE 2004.**Size:**2+11p**Format:**Compressed postscript- Get it

### Self-assemblying (classes of) shapes with a constant number of tiles.

**By:**Ivan Rapaport, Eric Rémila**Number:**RR2004-38**Date:**September 2004**Abstract:**

- Squares are the most studied shapes in the tile assembly model. Adleman et al. proved that the program size complexity of an $n \times n$ square is $\Theta(\frac{\log n}{\log\log n})$. In other words, for each $n$ we need a different set of tiles. Also, the size of the set increases with $n$. Our approach is to fix {\it a priori} the set of tiles in such a way that it always self-assembles into an $N \times N$ square. If $n$ is an arbitrary positive integer, we show how to calculate the tile concentrations in order to ensure that $\E(N)=n$. To our knowlwdge, the tile concentrations, a parameter of the original model, has never been seriously considered before. We claim that it is therefore not necessary to construct the tiles (which are in fact tiny molecules) for each shape we are asked to assemble. These tiny molecules are, for us, fixed primitives. In order to obtain (in expectation) the required shape, we only need to ``play'' with the concentrations. This is, of course, a standard procedure in Chemistry and Biology. Let $d$ be a distance in ${\Z}^2$. In the present work we tackle the specific problem of constructing tile systems that assemble into shapes of the form $\{ x \in {\Z}^2 \> \vert \> d(x,0) \leq N\}$ and such that $\E(N)=n$. We solve the problem for the max distance $d_{\infty}$ and for the $d_1$ distance. In the first case the induced shapes are squares while in the second case the induced shapes are diamonds. Diamonds are much more difficult to produce than squares. We leave as an open problem the Euclidean distance $d_2$, which induces the class of ``discrete circles''.

**Keywords:**

- Wang tiles, self-assembly.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+12p**Format:**Compressed postscript- Get it

### A Family of Modulo $(2^n+1)$ Multipliers.

**By:**Jean-Luc Beuchat**Number:**RR2004-39**Date:**September 2004**Abstract:**

- In this paper, we first describe a novel modulo $(2^n+1)$ addition algorithm suited to FPGA and ASIC implementations, and discuss several architectures of multioperand modulo $(2^n+1)$ adders. Then, we propose three implementations of a modulo $(2^n+1)$ multiplication algorithm based on a paper by A. Wrzyszcz and D. Milford. The first operator is based on an $n\times n$ multiplication and a subsequent modulo $(2^n+1)$ correction, and takes advantage of the arithmetic logic embedded in Spartan or Virtex FPGAs. The second operator computes a sum of modulo-reduced partial products by means of a multioperand modulo $(2^n+1)$ adder. Then, radix-$4$ modified Booth recoding reduces the number of partial products, while making their generation more complex. Finally, we provide a comparison of this family of algorithms with existing solutions.

**Keywords:**

- Modular Arithmetic, Modulo $(2^n+1)$ Addition, Modulo $(2^n+1)$ Multiplication, FPGA Implementation.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+13p**Format:**Compressed postscript- Get it

### Characterization of Bijective Discretized Rotations.

**By:**Bertrand Nouvel, Eric Rémila**Number:**RR2004-40**Date:**September 2004**Abstract:**

- A discretized rotation is the composition of an Euclidean rotation with the rounding operation. For $0 < \alpha < \pi/4$, we prove that the discretized rotation $\round{r_\alpha}$ is bijective if and only if there exists a positive integer $k$ such as $$\{\cos{\alpha},\sin{\alpha} \} = \{\frac{2k+1}{2k2+2k+1},\frac{2k2+2k}{2k2+2k+1} \}$$ The proof uses a particular subgroup of the torus $(\RR/\ZZ)2$.

**Keywords:**

- Discrete Rotations, Pythagorean Triple, Integer Pythagorean Triple, Bijective rotations, Local description.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+9p**Format:**Compressed postscript- Get it

### Some Functions Computable with a Fused-mac.

**By:**Sylvie Boldo, Jean-Michel Muller**Number:**RR2004-41**Date:**September 2004**Abstract:**

- The fused multiply accumulate instruction (fused-mac) that is available on some current processors such as the Power PC or the Itanium eases some calculations. We give examples of some floating-point functions (such as $\ulp(x)$ or $\mathrm{Nextafter}(x,y)$), or some useful tests, that are easily computable using a fused-mac. Then, we show that, with rounding to the nearest, the error of a fused-mac instruction is exactly representable as the sum of two floating-point numbers. We give an algorithm that computes that error.

**Keywords:**

- Floating-point arithmetic, fused multiply accumulate, computer arithmetic.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+10p**Format:**pdf- Get it

### Complex Square Root with Operand Prescaling.

**By:**Milos Ercegovac, Jean-Michel Muller**Number:**RR2004-42**Date:**September 2004**Abstract:**

- We propose a radix-$r$ digit-recurrence algorithm for complex square-root. The operand is prescaled to allow the selection of square-root digits by rounding of the residual. This leads to a simple hardware implementation. Moreover, the use of digit recurrence approach allows correct rounding of the result. The algorithm, compatible with the complex division, and its design are described at a high-level. We also give rough comparisons of its latency and cost with respect to implementation based on standard floating-point instructions as used in software routines for complex square root.

**Keywords:**

- Computer arithmetic, complex square-root, digit-recurrence algorithm, operand prescaling.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+12p**Format:**pdf- Get it

### RN-coding of numbers: definition and some properties.

**By:**Peter Kornerup, Jean-Michel Muller**Number:**RR2004-43**Date:**September 2004**Abstract:**

- We define RN-codings as radix-$\beta$ signed representations of numbers for which rounding to the nearest is always identical to truncation. After giving characterizations of such representations, we investigate some of their properties, and we suggest algorithms for conversion to and from these codings.

**Keywords:**

- Computer arithmetic, number systems.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+9p**Format:**pdf- Get it

### Correctly rounded multiplication by arbitrary precision constants.

**By:**Nicolas Brisebarre, Jean-Michel Muller**Number:**RR2004-44**Date:**October 2004**Abstract:**

- We introduce an algorithm for multiplying a floating-point number $x$ by a constant $C$ that is not exactly representable in floating-point arithmetic. Our algorithm uses a multiplication and a fused multiply accumulate instruction. We give methods for checking whether, for a given value of $C$ and a given floating-point format, our algorithm returns a correctly rounded result for any $x$. When it does not, our methods give the values $x$ for which the multiplication is not correctly rounded.

**Keywords:**

- Computer Arithmetic, Floating-point Arithmetic, Fused-mac, Multiplication by a constant, Correct rounding.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+14p**Format:**pdf- Get it

### Division by Constant for the ST100 DSP Microprocessor.

**By:**J.-M. Muller, A. Tisserand, B. Dupont de Dinechin, C. Monat**Number:**RR2004-45**Date:**October 2004**Abstract:**

- Algorithms for Euclidean (i.e., integer) division by a constant operation are presented. They allow fast computation for some values of the divisor (known at compile time) or also when both quotient and modulus are required. These algorithms are based on the multiply-accumulate instruction and the 40-bit arithmetic available in many DSPs. The results are demonstrated on the ST100 DSP from STMicroelectronics in the case of standard speech coding applications.

**Keywords:**

- Computer Arithmetic, Software Division, Euclidean Division, Division by Constant, DSP, Compiler.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+10p**Format:**Compressed postscript- Get it

### Broadcast Trees for Heterogeneous Platforms.

**By:**Olivier Beaumont, Loris Marchal, Yves Robert**Number:**RR2004-46**Date:**November 2004**Abstract:**

- In this paper, we deal with broadcasting on heterogeneous platforms. Typically, the message to be broadcast is split into several slices, which are sent by the source processor in a pipeline fashion. A spanning tree is used to implement this operation, and the objective is to find the tree which maximizes the throughput, i.e. the average number of slices sent by the source processor every time-unit. We introduce several heuristics to solve this problem. The good news is that the best heuristics perform quite efficiently, reaching more than 70% of the absolute optimal throughput, thereby providing a simple yet efficient approach to achieve very good performance for broadcasting on heterogeneous platforms.

**Keywords:**

- Collective communications, steady-state, heterogeneous platforms, modeling.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+18p**Format:**Compressed postscript- Get it

### Towards the post-ultimate libm.

**By:**Florent de Dinechin, Nicolas Gast**Number:**RR2004-47**Date:**November 2004**Abstract:**

- This article presents advances in the subject of double-precision correctly rounded elementary functions since the publication of the libultim mathematical library developed by Ziv at IBM. This library demonstrated that the performance overhead of correct rounding could be made negligible in average. However, the worst case execution time was up to 1000 times the average time, and memory consumption was also a problem. To address these questions, a range of new techniques, from the more portable to the more efficient, are presented, and demonstrated on two typical functions, exponential and arctangent. The main result of this paper is to show that the worst-case execution time can be bounded within a factor of 2 to 10 of the average time, with memory consumption comparable to current libms. This has in turn implications on the techniques and tradeoffs for correctly rounded functions. This article also shows that these techniques make it much easier to prove the correct rounding property. Thus, this article lifts the last technical obstacles to a widespread use of (at least some) correctly rounded double precision elementary functions.

**Keywords:**

- Elementary Functions, Correct Rounding, IEEE-754.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+16p**Format:**pdf- Get it

### When double rounding is odd.

**By:**Sylvie Boldo, Guillaume Melquiond**Number:**RR2004-48**Date:**November 2004**Abstract:**

- Double rounding consists in a first rounding in an intermediate extended precision and then a second rounding in the working precision. The natural question is then of the precision and correctness of the final result. Unfortunately, the used double rounding algorithms do not obtain a correct rounding of the initial value. We prove an efficient algorithm for the double rounding to give the correct rounding to the nearest value assuming the first rounding is to odd. As this rounding is unusual and this property is surprising, we formally proved this property using the Coq automatic proof checker.

**Keywords:**

- Floating-point, double rounding, formal proof, Coq.

**Availability:**Electronic copy only..**Citation:**Not published yet.**Size:**2+7p**Format:**pdf- Get it

### GoDIET: Un outil pour le déploiement de DIET .

**By:**Eddy Caron, Holly Dail**Number:**RR2004-49**Date:**November 2004**Abstract:**

- In this article, we address the automatic configuration, launch, and management of a distributed middleware platform for the Grid of type Application Service Provider (ASP) called Distributed Interactive Engineering Toolbox (DIET). The difficulty of this task is due to the architecture of DIET, which is distributed and hierarchical. The need for an automated tool for these tasks is reinforced by the diversity and number of components for the DIET middleware. In this article, we present GoDIET, a new tool created to satisfy these needs, though the solutions presented are valid for any distributed, hierarchical environment. The principles underlying GoDIET will be detailed by example with the configuration and launch of DIET and LogService, an external service for the management of traces for distributed software components. We then present a series of experiments that permit the evaluation of the performance and efficacity of GoDIET. We also evaluate the robustness of the DIET platform for a large number of servers.

**Keywords:**

- Deployment, PSE, Grid computing.

**Availability:**Electronic copy only. This report is writting in french.**Citation:**Not published yet.**Size:**2+8p**Format:**Compressed postscript- Get it

### Understanding untyped $\overline{\lambda}\mu\widetilde{\mu}$ calculus.

**By:**Pierre Lescanne, Silvia Likavec**Number:**RR2004-50**Date:**November 2004**Abstract:**

- We prove the confluence of
$\overline{\lambda}\mu\widetilde{\mu}_T$ and
$\overline{\lambda}\mu\widetilde{\mu}_Q$,
two well-behaved subcalculi of the $\overline{\lambda}\mu\widetilde{\mu}$
calculus, closed under call-by-value and call-by-name reduction, respectively.
Moreover, we give the interpretation of $\overline{\lambda}\mu\widetilde{\mu}_T$
in the category of negated domains, and the interpretation of
$\overline{\lambda}\mu\widetilde{\mu}_Q$ in the Kleisli category. To the best
of our knowledge this is the first interpretation of
*untyped*$\overline{\lambda}\mu\widetilde{\mu}$ calculus.

**Keywords:**

- Continuation semantics, classical logic, categories, type theory.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+15p**Format:**pdf- Get it

### Off-line scheduling of divisible requests on an heterogeneous collection of databanks.

**By:**Arnaud Legrand, Alan Su, Frédéric Vivien**Number:**RR2004-51**Date:**November 2004**Abstract:**

- In this paper, we consider the problem of scheduling comparisons of motifs against biological databanks. We show that this problem lies in the divisible load framework. In this framework, we propose a polynomial-time algorithm to solve the maximum weighted flow off-line scheduling problem on unrelated machines. We also show how to solve the maximum weighted flow off-line scheduling problem with preemption on unrelated machines .

**Keywords:**

- Bioinformatics, heterogeneous computing, scheduling, divisible load, linear programming, stretch, max weighted flow.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+8p**Format:**pdf- Get it

### Table-based polynomials for fast hardware function evaluation.

**By:**Jérémie Detrey, Florent de Dinechin**Number:**RR2004-52**Date:**November 2004**Abstract:**

- Many general table-based methods for the evaluation in hardware of elementary functions have been published. The bipartite and multipartite methods implement a first-order approximation of the function using only table lookups and additions. Recently, a single-multiplier second-order method of similar inspiration has also been published. This paper presents a general framework extending such methods to approximations of arbitrary order, using adders, small multipliers, and very small ad-hoc powering units. We obtain implementations that are both smaller and faster than all previously published approaches. This paper also deals with the FPGA implementation of such methods. Previous work have consistently shown that the more complex methods were also faster: The reduction of the table size meant a reduction of its lookup time, which compensated for the addition and multiplication time. A second contribution is therefore to finally create a tradeoff between space and time among table-based methods.

**Keywords:**

- Function evaluation, polynomial approximation, table-based method, hardware operators, FPGA.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+11p**Format:**Compressed postscript- Get it

### Hybrid scheduling for the parallel solution of linear systems.

**By:**Patrick R. Amestoy, Abdou Guermouche, Jean-Yves L'Excellent, Stéphane Pralet**Number:**RR2004-53**Date:**December 2004**Abstract:**

- In this paper, we consider the problem of designing a dynamic scheduling strategy that takes into account both workload and memory information in the context of the parallel multifrontal factorization. The originality of our approach is that we base our estimations (work and memory) on a static optimistic scenario during the analysis phase. This scenario is then used during the factorization phase to constrain the dynamic decisions. The task scheduler has been redesigned to take into account these new features. Moreover performance have been improved because the new constraints allow the new scheduler to make optimal decisions that were forbidden or too dangerous in unconstrained formulations. Performance analysis show that the memory estimation becomes much closer to the memory effectively used and that even in a constrained memory environment we decrease the factorization time with respect to the initial approach.

**Keywords:**

- Sparse Matrices, Parallel Multifrontal Method, Dynamic Scheduling, Memory.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+19p**Format:**pdf- Get it
**Revisions: none**

### Proposal for a Standardization of Mathematical Function Implementation in Floating-Point Arithmetic.

**By:**David Defour, Guillaume Hanrot, Vincent Lefèvre, Jean-Michel Muller, Nathalie Revol, Paul Zimmermann**Number:**RR2004-54**Date:**December 2004**Abstract:**

- Some aspects of what a standard for the implementation of the mathematical functions could be are presented. Firstly, the need for such a standard is motivated. Then the proposed standard is given. The question of roundings constitutes an important part of this paper: three levels are proposed, ranging from a level relatively easy to attain (with fixed maximal relative error) up to the best quality one, with correct rounding on the whole domain of every function. We do not claim that we always suggest the right choices, or that we have thought about all relevant issues. The mere goal of this paper is to raise questions and to launch the discussion towards a standard.

**Keywords:**

- Floating-point Arithmetic, Mathematical Function, Standard, Implementation, Rounding, Exception.

**Availability:**Electronic copy only.**Citation:**Numerical Algorithms, vol. 37, no 1-4, pp 367-375, 2004.**Size:**2+40p**Format:**Compressed postscript- Get it
**Revisions: none**

### Resource Localization Using Peer-To-Peer Technology for Network Enabled Server.

**By:**Eddy Caron , Frédéric Desprez , Franck Petit , Cédric Tedeschi**Number:**RR2004-55**Date:**December 2004**Abstract:**

- DIET (Distributed Interactive Engineering Toolbox) is a set of hierarchical components to design Network Enabled Server systems. These systems are built upon servers managed through distributed scheduling agents for a better scalability. Clients ask to these scheduling components to find servers available (using some performance metrics and information about the location of data already on the network). Our target architecture is the grid which is highly heterogeneous and dynamic. Clients, servers, and schedulers are better connected in a dynamic (or peer-to-peer) fashion. One critical issue to be solved is the localization of resources on the grid. In this paper, we present the use of an asynchronous version of the Propagate Information with Feedback algorithm to discover computation resources in Network Enabled Servers with distributed schedulers and arbitrary networks. Resource discovery uses peer-to-peer connections between components. Our implementation is based on JXTA from Sun Microsystems and DIET developed in the GRAAL team from INRIA. The algorithm and its implementation are discussed and performance results to show the benefit of this approach are given from experiments over the VTHD network which connects several supercomputers in different research institutes through a high-speed network.

**Keywords:**

- Resource Localization, P2P, Grid computing.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+14p**Format:**pdf- Get it
**Revisions: none**

### Accès optimisés aux fichiers distants dans les grappes disposant d'un réseau rapide.

**By:**Brice Goglin, Olivier Glück, Pascale Vicat-Blanc Primet**Number:**RR2004-56**Date:**December 2004**Abstract:**

- L'exécution performante d'applications parallèles sur grappes de calcul interconnectées par des réseaux hautes performances nécessite des communications efficaces entre les processus de calcul mais aussi des accès rapides au système de stockage. Cependant, comme l'interface logicielle des réseaux rapides a été optimisée plus particulièrement pour les communications interprocessus, elle ne répond pas bien aux besoins particuliers des systèmes de fichiers distribués. Dans cet article, nous proposons et étudions plusieurs solutions pour améliorer l'accès distants aux fichiers via les réseaux hautes performances de grappe. Ces idées ont été intégrées à l'interface GM des réseaux Myrinet puis à la nouvelle interface MX. Nous réalisons les premières comparaisons entre les performances relatives de MX et GM puis de leur utilisation dans un système de fichiers distribué. Nos premiers résultats montrent qu'il est possible d'interfacer efficacement les systèmes de fichiers distribués avec le réseau rapide des grappes.

**Keywords:**

- Systèmes de fichiers distribués, réseaux hautes performances, grappes, enregistrement mémoire, Linux.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+12p**Format:**pdf- Get it
**Revisions: none**

### Nested Circular Intervals: A Model for Barrier Placement in Single-Program, Multiple-Data Codes with Nested Loops.

**By:**Alain Darte, Robert Schreiber**Number:**RR2004-57**Date:**December 2004**Abstract:**

- We want to perform compile-time analysis of an SPMD program and place barriers in it to synchronize it correctly, minimizing the runtime cost of the synchronization. This is the barrier minimization problem. No full solution to the problem has been given previously. Here we model the problem with a new combinatorial structure, a nested family of sets of circular intervals. We show that barrier minimization is equivalent to finding a hierarchy of minimum cardinality point sets that cut all intervals. For a single loop, modeled as a simple family of circular intervals, a linear-time algorithm is known. We extend this result, finding a linear-time solution for nested circular intervals families. This result solves the barrier minimization problem for general nested loops.

**Keywords:**

- Barrier synchronization, circular arc graph, nested circular interval graph, SPMD code, nested loops.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+27p**Format:**pdf- Get it
**Revisions: none**

### Overlapping Computations and Communications with I/O in Wavefront Algorithms.

**By:**Eddy Caron, Frédéric Desprez, Frédéric Suter**Number:**RR2004-58**Date:**December 2004**Abstract:**

- Several numerical computation algorithms exhibit dependences that lead to a wavefront in the computation. Depending on the data distribution chosen, pipelining communication and computation can be the only way to avoid a sequential execution of the parallel code. The computation grain has to be wisely chosen to obtain at the same time a maximum parallelism and a small communication overhead. On the other hand, when the size of data exceeds the memory capacity of the target platform, data have to be stored on disk. The concept of \ooc computation aims at minimizing the impact of the I/O needed to compute on such data. It has been applied successfully on several linear algebra applications. In this paper we apply \ooc techniques to wavefront algorithms. The originality of our approach is to overlap computation, communication, and I/O. An original strategy is proposed using several memory blocks accessed in a cyclic manner. The resulting pipeline algorithm achieves a saturation of the disk resource which is the bottleneck in \ooc algorithms. \vspace-0.5cm.

**Keywords:**

- Out-of-Core, pipeline, wavefront algorithm, overlap.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+20p**Format:**pdf- Get it
**Revisions: none**

### Emulation d'un nuage réseau de grilles de calcul: eWAN.

**By:**Pascale Vicat-Blanc, Olivier Glück, Cyril Otal, François Echantillac**Number:**RR2004-59**Date:**December 2004**Abstract:**

- The Grid aims at expanding the cluster based parallel computing paradigm toward large scale distributed systems based on IP networks. To validate grid algorithms, evaluate their performance and to study transport and coordination protocols and grid network services, a well controlled environment is required, allowing to precisely manage the experience conditions. One of the most important issue is to emulate the potentially very high speed wide area network interconnection. This paper presents a software and hardware tool for configuring and programming a large PC cluster in a wide area network emulation instrument. High performance, fine parameter tuning and a great utilization flexibility are the main proposed features of this experimental tool. This article discusses the eWAN design principles and the first experimentations that have been done on a prototype deployed over the Grid5000 cluster at the ENS Lyon. Some usage scenarii are also proposed.

**Keywords:**

- Network Emulation, Grid networking, eWAN.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+14p**Format:**pdf- Get it
**Revisions: none**

### RN-codes : algorithmes d'addition, de multiplication et d'élévation au carré.

**By:**Jean-Luc Beuchat, Jean-Michel Muller**Number:**RR2004-60**Date:**December 2004**Abstract:**

- Une propriété du recodage de Booth dans sa forme originale est que le premier chiffre non nul qui suit un 1 est forcément -1 et réciproquement. Il est facile de montrer que, dans un tel recodage, tronquer est équivalent à arrondir au plus près. P. Kornerup et J.-M. Muller ont récemment étudié divers systèmes de numération à chiffres signés partageant cette propriété et les ont appelés RN-codes. Dans cet article, nous nous intéressons à des algorithmes d'addition et de multiplication pour des RN-codes en base 2 (c'est-à-dire des recodages de Booth). Nous montrons qu'il est possible d'utiliser les unités arithmétiques des processeurs pour additionner ou multiplier les RN-codes considérés. Nous proposons également des algorithmes tirant parti des spécificités des RN-codes pour générer des opérateurs matériels.

**Keywords:**

- Arithmétique des ordinateurs, RN-codes, recodage de Booth, arrondi au plus près.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+15p**Format:**pdf- Get it
**Revisions: none**

### Could any graph be turned into a small-world?

**By:**Philippe Duchon, Nicolas Hanusse, Emmanuelle Lebhar, Nicolas Schabanel**Number:**RR2004-61**Date:**December 2004**Abstract:**

- In addition to statistical graph properties (diameter, degree, clustering, ...), Kleinberg showed in 2000 that a small-world can also be seen as a graph in which the routing task can be efficiently and easily done. More precisely, in a lattice network augmented by extra random edges (but not chosen uniformly), a short path of polylogarithmic expected length can be found using a greedy algorithm with a local knowledge of the nodes. We call such a graph a navigable small-world since short paths exist and can be followed with partial knowledge of the network. In this paper, we show that a wide class of graphs can be augmented into navigable small-worlds.

**Keywords:**

- Small-world, Random Graph Model, Routing Algorithm.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+7p**Format:**Compressed postscript- Get it
**Revisions: none**

### Evaluation de polynômes et de fractions rationnelles sur FPGA avec des opérateurs à additions et décalages en grande base.

**By:**Romain Michard, Arnaud Tisserand and Nicolas Veyrat-Charvillon**Number:**RR2004-62**Date:**December 2004**Abstract:**

- This work deals with FPGA arithmetic operators, based on shift-and-add algorithm, for polynomial and rational approximation of functions. These operators are high-radix iterations of the E-method proposed by M. Ercegovac. Our results show high performances by mixing the simple architecture of shift-and-add algorithms and the generic nature of polynomial and rational approximations.

**Keywords:**

- Computer Arithmetic, Hardware Arithmetic Operator, Polynomial Evaluation, Rational Fraction Evaluation, E-method.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+10p**Format:**Compressed postscript- Get it
**Revisions: none**

### An efficient abstract machine for Safe Ambients.

**By:**Daniel Hirschkoff, Damien Pous, Davide Sangiorgi**Number:**RR2004-63**Date:**December 2004**Abstract:**

- Safe Ambients (SA) are a variant of the Ambient Calculus (AC) in which
types can be used to avoid certain forms of interferences among
processes called
*grave interferences*.

An abstract machine, called GCPAN, for a distributed implementation of typed SA is presented and studied. Our machine improves over previous proposals for executing AC, or variants of it, mainly through a better management of special agents (the*forwarders*), created upon code migration to transmit messages to the target location of the migration. Well-known methods (such as reference counting and union-find) are applied in order to garbage collect forwarders, thus avoiding long -- possibly distributed -- chains of forwarders, as well as avoiding useless persistent forwarders.

The proof of correctness of GCPAN, and a description of a distributed implementation of the abstract machine in OCaml are given.

Correctness is established by proving a weak bisimilarity result between GCPAN and a previous abstract machine for SA, and then appealing to the correctness of the latter machine. This is simpler than comparing GCPAN directly with SA, but it involves reasoning modulo `administrative reduction steps' in both machines and therefore standard techniques for simplifying proofs of weak bisimilarity results are not applicable.

More broadly, this study is a contribution towards understanding issues of correctness and optimisations in implementations of distributed languages encompassing mobility.

**Keywords:**

- Mobile Ambients, abstract machine, forwarders, garbage collection, bisimulation.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+33p**Format:**Compressed postscript- Get it
**Revisions: none**