### Normal Forms for General Polynomial Matrices.

**By:**Bernd Beckermann, George Labahn, Gilles Villard**Number:**RR2002-01**Date:**January 2002**Abstract:**

- We present an algorithm for the computation of a shifted Popov Normal
Form of a rectangular polynomial matrix. For specific input shifts, we
obtain methods for computing the matrix greatest common divisor of
two matrix polynomials (in normal form) or such polynomial normal form
computation as the classical Popov form and the Hermite Normal Form. The
method is done by embedding the problem of computing shifted forms into
one of computing matrix rational approximants. This has the advantage of
allowing for fraction-free computations over integral domains such as Z[z]
or K [z_1,..., z_n][z].
In the case of rectangular matrix input, the corresponding multipliers for the shifted forms are not unique. We use the concept of minimal matrix approximants to introduce a notion of minimal mutipliers and show how such multipliers are computed by our methods.

**Keywords:**

- Popov Normal Form, Hermite Normal Form, Matrix Gcd, Exact Arithmetic, Fraction-Free Algorithm.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+31p**Format:**Compressed postscript- Get it

### Computing the sign or the value of the determinant.

**By:**E. Kaltofen, Gilles Villard**Number:**RR2002-02**Date:**January 2002**Abstract:**

- Computation of the sign of the determinant of a matrix and determinant itself is a challenge for both numerical and exact methods. We survey the complexity of existing methods to solve these problems when the input is an n x n matrix A with integer entries. We study the bit-complexities of the algorithms asymptotically in n and the norm of A. Existing approaches rely on numerical approximate computations, on exact computations, or on both types of arithmetic in combination.

**Keywords:**

- Determinant, Bit-Complexity, Integer Matrix, Approximate Computation, Randomized Algorithms.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+14p**Format:**Compressed postscript- Get it

### Hyperbolic Recognition by Cellular Automata.

**By:**Christophe Papazian, Eric Rémila**Number:**RR2002-03**Date:**January 2002**Abstract:**

- Graph automata were first introduced by P. Rosenstiehl. A. Wu and A. Rosenfeld showed later how a graph automaton can study its own structure, by building a system of signals that explore the underlying graph, giving in this way algorithms in linear time allowing to know if the graph is a regular grid. Then, E. Rémila extended this result to other geometrical structures. We show here a very general method that allows to recognize all finite subsets of some Cayley graphs, without using some particular Euclidean information, like orientation, but some more general properties of automatic groups. Depending on the class of graphs we want to recognize, we can finally do different processing on the border of the detected geometrical structure, by using the small cancellations theorem. We study such graphs due to their good properties in network computing.

**Keywords:**

- Finite Automata, Hyperbolic Graphes.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+14p**Format:**Compressed postscript- Get it

### Parallel out-of-core matrix inversion.

**By:**Eddy Caron, Gil Utard**Number:**RR2002-04**Date:**January 2002**Abstract:**

- This paper presents a parallel out-of-core algorithm to invert huge matrices, that is when size of matrices is larger than the available physical memory by one or more orders of magnitude. Preliminary performance results are shown for a commodity cluster. An accurate prediction performance model of the algorithm is given. Thanks to the prediction model, optimizations which avoid the overhead of the out-of-core algorithm are derived. Performances of the optimized algorithm using a $O(N)$ memory size are similar to the performances of the best known parallel in-core algorithm using a $O(N^2)$ memory size (where $N$ is the matrix order). There is no memory restriction for inversion of huge matrices.

**Keywords:**

- Data-Parallelism, Out-of-Core, Matrix Factorization.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+17p**Format:**Compressed postscript- Get it

### Heterogeneous Multi-Cluster Networking with the Madeleine III Communication Library.

**By:**Olivier Aumage**Number:**RR2002-05**Date:**January 2002**Abstract:**

- This paper introduces the new version of the Madeleine portable multi-protocol communication library. Madeleine version III now includes full, flexible multi-cluster support associated to a redesigned version of the transparent multi-network message forwarding mechanism. Madeleine III works together with a new configuration management module to handle a wide panel of network-heterogeneous multi-cluster configurations. The integration of a new topology information system allows programmers of parallel computing applications to build highly optimized distributed algorithms on top of the transparent multi-network communication system provided by Madeleine III's virtual networks. The preliminary experiments we conducted regarding the new virtual network capabilities of Madeleine III showed interesting results with an asymptotic bandwidth of 43 MB/s over a virtual link made of a SISCI/SCI and a BIP/Myrinet physical link.

**Keywords:**

- Cluster of Cluster, High-Speed Networks, Multi-Protocol Communication, Heterogeneity.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+14p**Format:**Compressed postscript- Get it

### Impact of the implementation of MPI Point-to-Point Communications on the Performance of Two General Sparse Solvers.

**By:**Patrick R. Amestoy, Iain S. Duff, Jean-Yves L'Excellent, Xiaoye S. Li**Number:**RR2002-06**Date:**January 2002**Abstract:**

- We examine the send and receive mechanisms of MPI and how to implement message passing robustly so that performance is not significantly affected by changes to the MPI system. We discuss this within the context of two different parallel algorithms for sparse Gaussian elimination: a multifrontal solver (MUMPS), and a supernodal one (SuperLU). The performance of our initial strategies based on simple MPI point-to-point communication primitives is very sensitive to the MPI system, particularly the way MPI buffers are used. Using more sophisticated nonblocking communication primitives improves the performance robustness and scalability, but at the cost of increased code complexity.

**Keywords:**

- MPI, Message Passing, Immediate Communication Primitives, Direct Solvers, Sparse Matrices.

**Availability:**Electronic copy only.**Citation:**Submitted to Parallel Computing.**Size:**2+21p**Format:**Compressed postscript- Get it

### A polynomial-time algorithm for allocating independent tasks on heterogeneous fork-graphs.

**By:**Olivier Beaumont, Araud Legrand, Yves Robert**Number:**RR2002-07**Date:**February 2002**Abstract:**

- In this paper, we consider the problem of allocating a large number of independent, equal-sized tasks to a heterogeneous processor farm. The master processor and the p slaves have different computation and communication capabilities. We assume communication-computation overlap for each slave (and for the master), but the communication medium is exclusive: the master can only communicate with a single slave at each time-step. We give a polynomial-time algorithm to solve the following scheduling problem: given a time-bound T, what is the maximal number of tasks that can be processed by the master and the p slaves within T time-units?.

**Keywords:**

- Heterogeneous Processors, Scheduling, Mapping, Master-Slave, Complexity.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+10p**Format:**Compressed postscript- Get it

### Software Carry-Save for Fast Multiple-Precision Algorithms.

**By:**David Defour, Florent de Dinechin**Number:**RR2002-08**Date:**February 2002**Abstract:**

- This paper introduces a new machine representations of multiple-precision (MP) numbers, geared toward simple and fast implementations. We observe, in the usual high-radix representations, that carry management accounts for a lot of the complexity of the core multiple-precision algorithms (addition and multiplication). Our representation therefore trades space for simplicity : the digits of the multiple-precision numbers are coded on less bits than the machine numbers can offer. The reserved bits are used during MP addition and multiplication to guarantee that no overflow or rounding can occur in any internal computations. This leads to simple and therefore fast algorithms. In other words, all the carries generated in these internal computations are saved in these reserved bits, to be managed as late as possible. In addition to simplicity, this \emph{software carry-save representation} allows to expose parallelism just like its hardware counterpart. An initial implementation of this representation is shown to compare favorably to other multiple-precision libraries.

**Keywords:**

- Multiple Precision, Representation, Addition, Multiplication.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+10p**Format:**Compressed postscript- Get it

### The Equivalent Differentiated Services Model.

**By:**Benjamin Gaidioz, Pascale Primet**Number:**RR2002-09**Date:**February 2002**Abstract:**

- It is now quite agreed that the IP layer should provide more sophisticated services than the simple best-effort service to meet the application's quality of service requirements. Different proposals for improving the IP stack have been proposed but their deployment on experimental network shows different scaling problems. In this paper we present a new differentiated service scheme called ``Equivalent Differentiated Services''. The main features of the EDS scheme is that it allows to differentiate an arbitrary number of traffic classes without giving an absolute better service to any. EDS is simple, robust and scalable and provides no absolute guarantees in delay or loss rate to individual flows or aggregates. EDS acts as a network layer protocol and needs some adaptation done by an end-to-end transport layer. We present the EDS model, then we examine the implementation, simulation and validation issues.

**Keywords:**

- IP QoS, Service Differentiation, Different But Equal Models, Non-Elevated Models.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+9p**Format:**Compressed postscript- Get it

### Génération automatique d'architectures de calcul pour des opérations linéaires : application à l'IDCT sur FPGA.

**By:**Nicolas Boullis, Arnaud Tisserand**Number:**RR2002-10**Date:**March 2002**Abstract:**

- This article presents an automatic generation method for hardware operators based on multiplications by constants and additions. Using number recoding and decated common sub-expressions factorisation algorithms, one can significantly reduce the area of the generated operators. This method was implemented into a VHDL generator and tested on an IDCT application on Virtex FPGA from Xilinx. For this specific application, a 5.8 improvement factor was obtained for the area x delay product.

**Keywords:**

- Automatic Generation of Hardware Arithmetic Operators, Multiplication by Constants, Linear Operators, DCT, FPGA.

**Availability:**Electronic copy only.**Citation:**Proceedings of "Symposium en Architecture Nouvelles de Machines (SYMPA'8)", Hammamet, TUNISIA, April 10-13, 2002 (in French).**Size:**2+8p**Format:**Compressed postscript- Get it

### Improving Cluster IO Performance with READ^2.

**By:**Olivier Cozette, Cyril Randriamaro, Gil Utard**Number:**RR2002-11**Date:**March 2002**Abstract:**

- Grand challenge applications often need to process large amounts of data so high performance IO systems are needed. Cluster computing is a good approach to build cost effective IO intensive platform: sort benchmark records (MinuteSort, Datamation) are held by cluster architecture! New advances in IO technologies (disk, controller and network) let expected significant performance improvements for data intensive applications on cluster architectures. The counterpart of this evolution is that more stress is put on the different shared buses of each node. In this paper we investigate a solution to reduce the bus pressures to improve IO performance we called READ^2 (Remote Efficient Access to Distant Device). With READ^2, a cluster node is able to directly access to a remote disk without interfering with the remote processor and remote memory. With READ^2, a cluster can be considered as a \emph{shared disk} architecture instead of a \emph{shared nothing} one, and inherits works from the SAN community. This papers presents what are the architectural benefits of READ^2, i.e. a better use of IO and memory buses which may improve performance of streaming application.

**Keywords:**

- High Performance IO, Cluster, Streaming Applications.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+11p**Format:**Compressed postscript- Get it

### Scheduling strategies for master-slave tasking on heterogeneous processor grids.

**By:**Cyril Banino, Olivier Beaumont, Arnaud Legrand, Yves Robert**Number:**RR2002-12**Date:**March 2002**Abstract:**

- In this paper, we consider the problem of allocating a
large number of independent, equal-sized tasks to a heterogeneous
"grid" computing platform. We use a non-oriented graph to model a
grid, where resources can have different speeds of computation and
communication, as well as different overlap capabilities. We show
how to determine the optimal steady-state scheduling strategy for
each processor (the fraction of time spent computing and the
fraction of time spent communicating with each neighbor). This
result holds for a quite general framework, allowing for cycles
and multiple paths in the interconnection graph, and allowing for
several masters.

Because spanning trees are easier to deal with in practice (there is a single path from the master to each node), a natural question arises: how to extract the best spanning tree, i.e. the one with optimal steady-state throughput, out of a general interconnection graph? We show that this problem is NP-hard. Even worse, we show that there exist heterogeneous networks for which the optimal spanning tree has a throughput which is arbitrarily bad in front of the throughput that can be achieved by the optimal (multiple-path) solution. Still, we introduce and compare several low-complexity heuristics to determine a sub-optimal spanning tree. Fortunately, we observe that the best heuristics do achieve an excellent performance in most experiments.

**Keywords:**

- Heterogeneous Processors, Master-Slave Tasking, Communication, Spanning Trees, Complexity.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+29p**Format:**Compressed postscript- Get it

### The theory of Liouville functions.

**By:**Pascal Koiran**Number:**RR2002-13**Date:**March 2002**Abstract:**

- A Liouville function is a complex analytic function H with a Taylor series \sum_{n=1}^{\infty} x^n/a_n such the a_n's form a ``very fast growing'' sequence of integers. In this paper we exhibit the complete first-order theory of the complex field expanded with H.

**Keywords:**

- Model Theory, Amalgamation, Complex Variable, Analytic Functions.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+14p**Format:**Compressed postscript- Get it

### Proof of correctness of the Mazoyer's solution of the firing squad problem in Coq.

**By:**Jean Duprat**Number:**RR2002-14**Date:**March 2002**Abstract:**

- The firing squad synchronization is a cellular automata problem introduced by Moore; in 1987, J.Mazoyer gave a six state solution to this problem. The proof of correctness of this solution uses discrete geometrical considerations, but is quite hard to verify due to the multiplication of cases and indexes. To be more confident in this proof, a proof assistant developed by the Inria, Coq has been used. This report exposes the development in Coq of this proof. A large use of inductive structures, a natural way in Coq, offers a clearer vision of the Mazoyer's solution. The full development of the proof is avalaible in the contributions of the Coq software.

**Keywords:**

- Cellular Automata, Firing Squad, Proof Assistant, Coq.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+36p**Format:**Compressed postscript- Get it

### LinBox: A Generic Library for Exact Linear Algebra.

**By:**J.G.Dumas, T.Gautier, M.Giesbrecht, P.Giorgi, B.Hovinen, E.Kaltofen, B.D.Saunders, W.J.Turner and G.Villard**Number:**RR2002-15**Date:**March 2002**Abstract:**

- LinBox is a high-performance generic software library for black box linear algebra over symbolic (exact) entry domains. The generic software methodology enables the user to instantiate the procedures in the library with a multitude of coefficient domains and black box matrices without sacrificing performance. At the top level, LinBox provides algorithms for many standard problems in linear algebra, such as equation solving and matrix normal forms, and includes toolboxes for Lanczos-Krylov approaches and for algebraic preconditioning.

**Keywords:**

- Linear Algebra, Randomized Algorithms, Black Box Matrix, Matrix Rank, Seminumeric Computation.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+12p**Format:**Compressed postscript- Get it

### Using Ambients to Control Resources.

**By:**D. Teller, P. Zimmer, D. Hirschkoff**Number:**RR2002-16**Date:**April 2002**Abstract:**

- Current software and hardware systems, being parallel and
reconfigurable, raise new safety and reliability problems, and the
resolution of these problems requires new methods. Numerous
proposals attempt at reducing the threat of bugs and preventing
several kinds of attacks. In this paper, we develop an extension of
the calculus of Mobile Ambients, named
*Controlled Ambients*, that is suited for expressing such issues, specifically Denial of Service attacks. We present a type system for Controlled Ambients, which makes resource control possible in our setting.

**Keywords:**

- Distributed and Mobile Systems, Resource Control, Process Algebra, Type System, Mobile Ambients.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+28p**Format:**Compressed postscript- Get it

### New Results on Array Contraction.

**By:**Alain Darte, Guillaume Huard**Number:**RR2002-17**Date:**April 2002**Abstract:**

- Array contraction is an optimization that transforms array variables into scalar variables within a loop. While the opposite transformation, scalar expansion, is used for enabling parallelism (with a penalty in memory size), array contraction is used to save memory by removing temporary arrays and to increase locality. Several heuristics have already been proposed to perform array contraction through loop fusion and/or loop shifting. But so far, the complexity of the problem was unknown, and no exact approach was available. In this \report, we prove two NP-complete results that characterize precisely the problem and we give a practical integer linear programming formulation to solve the problem exactly.

**Keywords:**

- Code Optimization, Array Contraction, Memory Reduction, NP-Completeness, Integer Linear Programming.

**Availability:**Electronic copy only.**Citation:**In the proceedings of ASAP'02.**Size:**2+15p**Format:**Compressed postscript- Get it

### Separability, Expressiveness, and Decidability in the Ambient Logic.

**By:**D. Hirschkoff, E. Lozes, D. Sangiorgi**Number:**RR2002-18**Date:**May 2002**Abstract:**

- The
*Ambient Logic*(AL) has been proposed for expressing properties of process mobility in the calculus of Mobile Ambients (MA), and as a basis for query languages on semistructured data. We study some basic questions concerning the descriptive and discriminating power of AL, focusing on the equivalence on processes induced by the logic (*=L*). We consider*MA*, and two Turing complete subsets of it,*MAIF*and*MAIFsyn*, respectively defined by imposing a semantic and a syntactic constraint on process prefixes. The main contributions include: coinductive and inductive operational characterisations of*=L*; an axiomatisation of*=L*on*MAIFsyn*; the construction of characteristic formulas for the processes in*MAIF*with respect to*=L*; the decidability of*=L*on*MAIF*and on*MAIFsyn*, and its undecidability on*MA*.

**Keywords:**

- Distributed and Mobile Systems, Modal Logics, Moble Ambients, Decidability, Expressiveness, Characteristic Formula.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+17p**Format:**Compressed postscript- Get it

### Parallel Extension of a Dynamic Performance Forecasting Tool.

**By:**Eddy Caron, Frédéric Suter**Number:**RR2002-19**Date:**May 2002**Abstract:**

- This article presents an extension of the FAST library to handle parallel routines. FAST is a dynamic performance forecasting tool in a metacomputing environment. Here we propose to combine estimations given by FAST about sequential computation routines and network availability to parallel routine models coming from analysis. This association will be defined and detailed on two examples and validated experimentally.

**Keywords:**

- Performance Forecasting and Modeling, Parallel Routines.

**Availability:**Electronic copy only.**Citation:**Proceedings of the International Symposium on Parallel and Distributed Computing (ISPDC), Iasi, Romania, July 2002.**Size:**2+12p**Format:**Compressed postscript- Get it

### Scheduling strategies for mixed data and task parallelism on heterogeneous processor grids.

**By:**Olivier Beaumont, Arnaud Legrand, Yves Robert**Number:**RR2002-20**Date:**May 2002**Abstract:**

- In this paper, we consider the execution of a complex application on a heterogeneous "grid" computing platform. The complex application consists of a suite of identical, independent problems to be solved. In turn, each problem consists of a set of tasks. There are dependences (precedence constraints) between these tasks. A typical example is the repeated execution of the same algorithm on several distinct data samples. We use a non-oriented graph to model the grid platform, where resources have different speeds of computation and communication. We show how to determine the optimal steady-state scheduling strategy for each processor (the fraction of time spent computing and the fraction of time spent communicating with each neighbor). This result holds for a quite general framework, allowing for cycles and multiple paths in the platform graph.

**Keywords:**

- Heterogeneous Processors, Scheduling, Mixed Data and Task Parallelism.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+13p**Format:**Compressed postscript- Get it

### A Scalable Approach to Network Enabled Servers.

**By:**Eddy Caron, Philippe Combes, Sylvain Contassot-Vivier, Frédéric Desprez, Frédéric Lombard, Jean-Marc Nicod, Laurent Philippe, Martin Quinson, Frédéric Suter**Number:**RR2002-21**Date:**May 2002**Abstract:**

- This report presents the architecture and the algorithms used in DIET (Distributed Interactive Engineering Toolbox), a hierarchical set of components to build Network Enabled Server applications in a Grid environment. This environment is built on top of different tools which are able to locate an appropriate server depending of the client's request, the data location (which can be anywhere on the system, because of previous computations) and the dynamic performance characteristics of the system. Some experiments are related at the end of this report, that exhibit the low cost of adding branches in the hierarchical tree of components and the performance increase induced.

**Keywords:**

- Metacomputing, Computational Servers, Agent Hierarchy.

**Availability:**Electronic copy only.**Citation:**Extended version of the proceedings of Europar 2002, Paderborn, Germany.**Size:**2+16p**Format:**Compressed postscript- Get it

### On edge tricolorations of triangulations of simply connected surfaces.

**By:**Olivier Bodini, Eric Rémila**Number:**RR2002-22**Date:**May 2002**Abstract:**

- This paper studies the tricolorations of edges of triangulations of simply connected orientable surfaces such that the degree of each interior vertex is even. Using previous results on lozenge tilings, we give a linear algorithm of coloration for triangulations of the sphere, or of planar regions with the constraint that the boundary is monochromatic. We define a flip as a shift of colors on a cycle of edges using only two colors. We prove flip connectivity of the set of solutions for the cases seen above, and prove that there is no flip accessibility in the general case where the boundary is not assumed to be monochromatic. Nevertheless, using flips, we obtain a tiling invariant, even in the general case. We finish relaxing the condition, allowing monochromatic triangles. With this hypothesis, there exists some local flips. We give a linear algorithm of coloration, and strong structural results on the set of solutions.

**Keywords:**

- Tiling, Height Function, Flip.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+10p**Format:**Compressed postscript- Get it

### Properties of the subtraction valid for any floating point system.

**By:**Marc Daumas, Sylvie Boldo**Number:**RR2002-23**Date:**June 2002**Abstract:**

- We start in this text with a very generic definition of floating point systems. We show that just a few very natural necessary conditions are sufficient to focus down to two classes of implemented floating point arithmetic. Later, we prove that, for all the existing implementations, high level properties such as Sterbenz's theorem are satisfied. We finish this text by focusing on the differences between an IEEE-754 compatible unit and Texas Instrument TMS/SMJ 320C3x digital signal processing circuit that is recommended for avionics and military applications. The results presented in this text have been validated by the Coq automatic proof checker to build confidence for later implementations in critical systems such as an aircraft flight control primary or secondary computer.

**Keywords:**

- Digital Signal Processing, Floating Point, Avionics, Formal Proof.

**Availability:**Electronic copy only.**Citation:**ERCIM Seventh International Workshop on Formal Methods for Industrial Critical Systems (FMICS 02).**Size:**2+17p**Format:**Compressed postscript- Get it

### Impact of Mixed--Parallelism on Parallel Implementations of Strassen and Winograd Matrix Multiplication Algorithms.

**By:**F. Desprez, F. Suter**Number:**RR2002-24**Date:**June 2002**Abstract:**

- In this paper we study the impact of the simultaneous exploitation of data-- and task--parallelism on Strassen and Winograd matrix multiplication algorithms. We present two mixed--parallel implementations. The former follows the phases of the original algorithms while the latter has been designed as the result of a list scheduling algorithm. We give a theoretical comparison, in terms of memory usage and execution time, between our algorithms and classical data--parallel implementations. This analysis is corroborated by experiments. Finally we give some hints about an heterogeneous version of our algorithms.

**Keywords:**

- Mixed--Parallelism, Matrix Product, Strassen, Winograd.

**Availability:**Electronic copy only.**Citation:**ERCIM Seventh International Workshop on Formal Methods for Industrial Critical Systems (FMICS 02).**Size:**2+20p**Format:**Compressed postscript- Get it

### Small Multiplier-based Multiplication and Division Operators for Virtex-II Devices.

**By:**Jean-Luc Beuchat, Arnaud Tisserand**Number:**RR2002-25**Date:**July 2002**Abstract:**

- This paper presents integer multiplication and division operators dedicated to Virtex-II FPGAs from Xilinx. Those operators are based on small 18x18 multiplier blocks available in the Virtex-II device family. Various trade-offs are explored (computation decomposition, radix, digit sets, ...) using specific VHDL generators. The obtained operators lead to speed improvements up to 18% for multiplication and 40% for division compared to standard solutions only based on CLBs.

**Keywords:**

- Multiplication, Division, FPGA, Virtex-II family.

**Availability:**Electronic copy only.**Citation:**12th International Conference on Field Programmable Logic and Application (FPL 2002).**Size:**2+10p**Format:**Compressed postscript- Get it

### High Throughput Implementations of the RC6 Block Cipher Using Virtex-E and Virtex-II Devices.

**By:**Jean-Luc Beuchat**Number:**RR2002-26**Date:**July 2002**Abstract:**

- This short paper is devoted to the study of effective hardware architectures for the RC6 block cipher using Virtex-E and Virtex-II FPGA devices. The key point of the implementation is the design of an arithmetic operator computing f(X)=(X(2X+1))mod 2^w. Significant speed and area improvements are obtained by taking full advantage of the small multiplier blocks available in Virtex-II devices.

**Keywords:**

- FPGA, RC6 Block Cipher, Computer Arithmetic, Cryptography.

**Availability:**Electronic copy only.**Citation:**Submitted to IEEE Transactions and Circuits and Systems - Part II.**Size:**2+10p**Format:**Compressed postscript- Get it

### Motivations for an arbitrary precision interval arithmetic and the MPFI library.

**By:**Nathalie Revol, Fabrice Rouillier**Number:**RR2002-27**Date:**July 2002**Abstract:**

- This paper explains why an arbitrary precision interval arithmetic
is needed: to provide accurate results, an interval computation requires small input
intervals; this explains why bisection is so often employed in interval algorithms.

The MPFI library has been built in order to fulfill this need: indeed, no such library met the required specifications. Its main features are briefly given and a comparison with a fixed-precision interval arithmetic, on a specific problem, is presented: it shows that the overhead due to the management of the multiple precision is completely admissible.

Eventually, some applications based on MPFI are given: robotics, isolation of polynomial real roots (by an algorithm combining symbolic and numerical computations) and approximation of real roots with arbitrary accuracy.

**Keywords:**

- Arbitrary Precision Interval Arithmetic, Reliability and Accuracy.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+11p**Format:**Compressed postscript- Get it

### MetaSimGrid : Towards realistic scheduling simulation of distributed applications.

**By:**Arnaud Legrand, Julien Lerouge**Number:**RR2002-28**Date:**July 2002**Abstract:**

- Most scheduling problems are already hard on homogeneous platforms, they become quite intractable in an heterogeneous framework such as a metacomputing grid. In the best cases, a guaranteed heuristic can be found, but most of the time, it is not possible. Real experiments or simulations are often involved to test or to compare heuristics. However, on a distributed heterogeneous platform, such experiments are technically difficult to drive, because of the genuine instability of the platform. It is almost impossible to guarantee that a platform which is not dedicated to the experiment, will remain exactly the same between two tests, thereby forbidding any meaningful comparison. Simulations are then used to replace real experiments, so as to ensure the reproducibility of measured data. A key issue is the possibility to run the simulations against a realistic environment. The main idea of trace-based simulation is to record the platform parameters today, and to simulate the algorithms tomorrow, against the recorded data: even though it is not the current load of the platform, it is realistic, because it represents a fair summary of what happened previously. A good example of a trace-based simulation tool is \textscSimGrid, a toolkit providing a set of core abstractions and functionalities that can be used to easily build simulators for specific application domains and/or computing environment topologies. Nevertheless, \textscSimGrid lacks a number of convenient features to craft simulations of a distributed application where scheduling decisions are not taken by a single process. Furthermore, modeling a complex platform by hand is fastidious for a few hosts and is almost impossible for a real grid. This report is a survey on simulation for scheduling evaluation purposes and present \textscMetaSimGrid, a simulator built on top of \textscSimGrid.

**Keywords:**

- Simulation, Scheduling, Distributed Application.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+44p**Format:**Compressed postscript- Get it

### Static scheduling strategies for heterogeneous systems.

**By:**Olivier Beaumont, Arnaud Legrand, Yves Robert**Number:**RR2002-29**Date:**July 2002**Abstract:**

- In this paper, we consider static scheduling techniques for heterogeneous systems, such as clusters and grids. We successively deal with minimum makespan scheduling, divisible load scheduling and steady-state scheduling. Finally, we discuss the limitations of static scheduling approaches.

**Keywords:**

- Static Scheduling, Clusters, Grids, Makespan, Divisible Load, Steady-State.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+14p**Format:**Compressed postscript- Get it

### Accelerating Floating-Point Division When the Divisor is Known in Advance.

**By:**Jean-Michel MULLER**Number:**RR2002-30**Date:**August 2002**Abstract:**

- We present techniques for accelerating the floating-point computation of x/y when y is known before x. The goal is to get exactly the same result as with usual division with rounding to nearest. These techniques can be used by compilers to make some numerical programs run faster, without any loss in terms of accuracy..

**Keywords:**

- Computer arthmetic, Floating-point Arithmetic, Division, Compilation Optimization.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+14p**Format:**Compressed postscript- Get it

### Vandermonde Matrices, NP-Completeness, and Transversal Subspaces.

**By:**Alexander Chistov, Hervé Fournier, Leonid Gurvits, Pascal Koiran**Number:**RR2002-31**Date:**September 2002**Abstract:**

- Let E be a vector space of dimension n over an infinite field K. We give polynomial time constructions of families of r-dimensional subspaces of E with the following transversality property: any linear subspace of E of dimension n-r is transversal to at least one element of the family. We also give a new NP-completeness proof for the following problem: given two integers n and m (m bigger than n) and a n times m matrix A with integer entries, decide whether there exists a n times n sub-determinant of A which is equal to zero.

**Keywords:**

- Linear Algebra, Transversality, Derandomization, NP-completeness.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+6p**Format:**Compressed postscript- Get it

### Modular Multiplication for FPGA Implementation of the IDEA Block Cipher.

**By:**Jean-Luc Beuchat**Number:**RR2002-32**Date:**September 2002**Abstract:**

- The IDEA block cipher is a symmetric-key algorithm which encrypts 64-bit plaintext blocks to 64-bit ciphertext blocks, using a 128-bit secret key. The security of IDEA relies on combining operations from three algebraic groups: integer addition modulo 2^n, bitwise exclusive or of two $n$-bit words, and integer multiplication modulo (2^n+1) which is the critical arithmetic operation of the block cipher. In this paper, we investigate three algorithms based on a small multiplication with a subsequent modulo correction. They are particularly well suited for the latest FPGA devices embedding small multiplier blocks, like the Virtex-II family. We also consider a multiplier based on modulo (2^n+1) adders. Several architectures of the IDEA block cipher are then described and compared from different point of view: throughput to area ratio or adequation with feedback and non-feedback chaining modes. Our fastest circuit achieves a throughput of 8.5 Gb/s, which is, to our knowledge, the best rate reported in the literature.

**Keywords:**

- Computer Arithmetic, Modulo (2^n+1) Multiplication, IDEA Block Cipher, Cryptography, FPGA.

**Availability:**Electronic copy only.**Citation:**Submitted to FPGA2003.**Size:**2+10p**Format:**Compressed postscript- Get it

### Evaluation polynomiale en-ligne de fonctions élémentaires sur FPGA.

**By:**Jean-Luc Beuchat et Arnaud Tisserand**Number:**RR2002-33**Date:**September 2002**Abstract:**

- Cet article présente une architecture matérielle pour l'évaluation de fonctions élémentaires en arithmétique en-ligne sur circuits FPGA. Les points particuliers traités sont la détermination automatique de bons polynômes d'approximation, et la génération automatique de la description VHDL synthétisables des opérateurs correspondants. Cette approche a été implantée et validée sur des circuits FPGA Xilinx.

**Keywords:**

- Arithmétique des Ordinateurs, Evaluation des Fonctions Elémentaires, Approximation Polynomiale, Arithmétique en-ligne, Circuits FPGA.

**Availability:**Electronic copy only.**Citation:**A paraître dans TSI. (in French).**Size:**2+13p**Format:**Compressed postscript- Get it

### One-Step Algorithm for Mixed Data and Task Parallel Scheduling Without Data Replication.

**By:**V. Boudet, F. Desprez, F. Suter**Number:**RR2002-34**Date:**October 2002**Abstract:**

- In this paper we propose an original algorithm for mixed data and task parallel scheduling. The main specificities of this algorithm are to simultaneously perform the allocation and scheduling processes, and avoid the data replication. The idea is to base the scheduling on an accurate evaluation of each task of the application depending on the processor grid. Then no assumption is made with regard to the homogeneity of the execution platform. The complexity of our algorithm are given. Performance achieved by our schedules both in homogeneous and heterogeneous worlds, are compared to data-parallel executions for two applications: the complex matrix multiplication and the Strassen decomposition.

**Keywords:**

- Mixed--parallelism, Scheduling, Data Replication.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+18p**Format:**Compressed postscript- Get it

### Timed continuous Petri nets and optimization via linear programming.

**By:**Bruno Gaujal, Alessandr0 Giua**Number:**RR2002-35**Date:**October 2002**Abstract:**

- In this paper, we consider a deterministic timed continuous Petrinet model where conflicts at places are solved by usingstationary routing parameters. We show how to compute thestationary firing rate for all transitions via linearprogramming, so as to determine the optimal routing parametersthat maximize the firing rates. Finally, we discuss the relationswith discrete Petri nets.

**Keywords:**

- Continuous Petri nets, Optimal Routing, Linear Programming.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+25p**Format:**Compressed postscript- Get it

### Optimal algorithms for scheduling divisible workloads on heterogeneous systems.

**By:**Olivier Beaumont, Arnaud Legrand, Yves Robert**Number:**RR2002-36**Date:**October 2002**Abstract:**

- In this paper, we discuss several algorithms for scheduling divisible loads on heterogeneous systems. Our main contributions are (i) new optimality results for single-round algorithms and (ii) the design of an asymptotically optimal multi-round algorithm. This multi-round algorithm automatically performs resource selection, a difficult task that was previously left to the user. Because it is periodic, it is simpler to implement, and more robust to changes in the speeds of processors or communication links. On the theoretical side, to the best of our knowledge, this is the first published result assessing the absolute performance of a multi-round algorithm. On the practical side, extensive simulations reveal that our multi-round algorithm outperforms existing solutions on a large variety of platforms, especially when the communication-to-computation ratio is not very high (the difficult case).

**Keywords:**

- Scheduling, Divisible Tasks, Multi-Round Algorithms, Asymptotical Optimality.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+23p**Format:**Compressed postscript- Get it

### Some Modular Adders and Multipliers for Field Programmable Gate Arrays.

**By:**Jean-Luc Beuchat**Number:**RR2002-37**Date:**October 2002**Abstract:**

- This paper is devoted to the study of number representations and algorithms leading to efficient implementations of modular adders and multipliers on recent Field Programmable Arrays. Our hardware operators take advantage of the building blocks available in such devices: carry-propagate adders, memory blocks, and sometimes embedded multipliers. The first part of the paper describes three basic methodologies to carry out a modulo m addition and presents in more details the design of modulo $(2^n\pm 1)$ adders. The major result is a new modulo $(2^n+1)$ addition algorithm leading to an area-time efficient implementation of this arithmetic operation on FPGAs. The second part describes a modulo $m$ multiplication algorithm involving small multipliers and memory blocks, and modulo $(2^n+1)$ multipliers based on Ma's algorithm. We also suggest some improvements of this operator in order to perform a multiplication in the group $(\mathbb{Z}^*_{2^n+1}, \cdot)$.

**Keywords:**

- Computer Arithmetic, Modulo m Addition, Modulo m Multiplication, FPGA.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+10p**Format:**Compressed postscript- Get it

### Cache-Optimised Methods for the Evaluation of Elementary Functions.

**By:**David Defour**Number:**RR2002-38**Date:**October 2002**Abstract:**

- The ratio between processor speed and memory speed frequently makes efficient use of cache memory a very important element in performance of user's application. This is the case for many elementary function algorithms. They commonly use tables, so that caches have a deep impact on speed. This paper discusses and quantifies the impact of cache usage over both major criteria in the evaluation of elementary functions: speed and accuracy.

**Keywords:**

- Elementary Functions, Cache memory, Microprocessors, Application Performance.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+12p**Format:**Compressed postscript- Get it

### "Partially rounded" Small-Order Approximations for Accurate, Hardware-Oriented, Table-Based Methods.

**By:**Jean-Michel Muller**Number:**RR2002-39**Date:**October 2002**Abstract:**

- We aim at evaluating elementary and special functions using small tables and small, rectangular, multipliers. To do that, we show how accurate polynomial approximations whose order-$1$ coefficients are small in size (a few bits only) can be computed. We compare the obtained results with similar work in the recent literature.

**Keywords:**

- Computer arithmetic, elementary and special functions, table-based methods, polynomial approximations.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+10p**Format:**Compressed postscript- Get it

### A Network Model for Simulation of Grid Application.

**By:**Henri Casanova, Loris Marchal**Number:**RR2002-40**Date:**October 2002**Abstract:**

- In this work we investigate network models that can be potentiallyemployed in the simulation of scheduling algorithms for distributedcomputing applications. We seek to develop a model of TCP communicationwhich is both high-level and realistic. Previous research works show thataccurate and global modeling of wide-area networks, such as the Internet,faces a number of challenging issues. However, some global models offairness and bandwidth-sharing exist, and can be link withthe behavior of TCP.Using both previous results and simulation (with NS), weattempt to understand the macroscopic behavior of TCPcommunications. We then propose a global model of the network for the Grid platform. We perform partial validation of this model in simulation.The model leads to an algorithm for computing bandwidth-sharing. Thisalgorithm can then be implemented as part of Grid application simulations.We provide such an implementation for the SimGrid simulationtoolkit.

**Keywords:**

- Grid Application, Simulation, TCP Modeling, Bandwith-Sharing, Fairness, SimGrid.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+31p**Format:**Compressed postscript- Get it

### New Complexity Results on Array Contraction and Related Problems (Extension to Research Report 2002-17).

**By:**Alain Darte, Guillaume Huard**Number:**RR2002-41**Date:**October 2002**Abstract:**

- Array contraction is an optimization that transforms array variables into scalar variables within a loop. While the opposite transformation, scalar expansion, is used for enabling parallelism (with a penalty in memory size), array contraction is used to save memory by removing temporary arrays and to increase locality. Several heuristics have already been proposed to perform array contraction through loop fusion and/or loop shifting. But so far, the complexity of the problem was unknown, and no exact approach was available (and even more, only a sufficient condition for array contraction was used). In this report, we focus on the theoretical aspects of the problem. We prove several NP-complete results that characterize precisely its complexity and we provide an integer linear programming formulation to solve the problem exactly. Our study also proves the NP-completeness of similar problems whose complexity was not established so far.

**Keywords:**

- Code Optimization, Array Contraction, Memory Reduction, Complexity, NP-Completeness, Integer Linear Programming.

**Availability:**Electronic copy only.**Citation:**Extended version of the publication at ASAP'02.**Size:**2+27p**Format:**Compressed postscript- Get it

### On the Memory Usage of a Parallel Multifrontal Solver.

**By:**Abdou Guermouche, Jean-Yves L'Excellent, Gil Utard**Number:**RR2002-42**Date:**November 2002**Abstract:**

- We are interested in the memory usage of sparse direct solvers. We particularly focus on the parallel multifrontal scheme. In the multifrontal approach two kinds of memory can be distinguished: a static one which corresponds to the result of the factorization process (ie, the factors), and a dynamic or active one, usually handled by a stack mechanism, which corresponds to the working space of the factorization process. For some problems the stack size may be as large as and even greater than the final factors. The size of the stack depends on the assembly tree and on how the computation is distributed. We present an extensive study of the impact of state-of-the-art sparse matrix reordering techniques on the assembly tree and on the memory occupation of the \mumps{} solver in both sequential and parallel executions. The main observation of this study is that the stack of parallel multifrontal solvers does not scale well if a dynamic scheduling strategy based only on the balance of the workload is used.

**Keywords:**

- Sparse Matrices, Multifrontal Method, Assembly Tree, Reordering Techniques, Memory.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+14p**Format:**Compressed postscript- Get it

### A Kleene Theorem for Piecewise Constant Signals Automata (extended abstract).

**By:**J. Durand-Lose**Number:**RR2002-43**Date:**November 2002**Abstract:**

- In this paper, we consider timed automata for piecewise constant signals.In the model presented here, time elapses only during transitions; any constraint on clocks should be satisfied during all the duration of the transition. Signal automata are very different from un-timed and time-event automata because piecewise constant signals may be split (and spliced) in an infinite number of ways. We show that there exist signal regular expressions with renaming describing exactly the languages accepted by signal automata. The constructions show the similarities and differences from the time-event model.

**Keywords:**

- Timed Automata, Piecewise Constant Signals, Regular Expression.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+10p**Format:**Compressed postscript- Get it

### Necessary and sufficient conditions for exact floating point operations.

**By:**Sylvie Boldo, Marc Daumas**Number:**RR2002-44**Date:**November 2002**Abstract:**

- Studying floating point arithmetic, authors have shown that the implemented operations (addition, subtraction, multiplication, division and square root) can compute a result and an exact correcting term using the same format as the inputs. Following a path initiated in 1965, all the authors supposed that neither underflow nor overflow occurred in the process. Overflow is not critical as some kind of exception is triggered by such an event that creates remanent non numeric quantities. Underflow may be fatal to the process as it returns wrong numeric values with little warning. Our new necessary and sufficient conditions guarantee that the exact floating point operations are correct when the result is a number. We also present properties when precise rounding is not available in hardware and faithful rounding alone is performed such as using some digital signal processing circuit. We have validated our proofs against the Coq automatic proof checker. Our development has raised many questions, some of them were expected while other ones were very surprising.

**Keywords:**

- Virgule flottante, Norme IEEE 754, Preuve formelle, Coq.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+1p**Format:**Compressed postscript- Get it

### A new scheme for table-based evaluation of functions.

**By:**David Defour, Florent de Dinechin, Jean-Michel Muller**Number:**RR2002-45**Date:**November 2002**Abstract:**

- This paper presents a new scheme for the hardware evaluation of functions in fixed-point format, for precisions up to 30 bits. This scheme yields an architecture made of four look-up tables, a multi-operand adder, and two small multipliers. This new method is evaluated and compared with other published methods.

**Keywords:**

- Bipartite tables, table-and-addition methods, multipartite, Elementary functions.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+7p**Format:**Compressed postscript- Get it

### Tiling a polygon with two kinds of rectangles.

**By:**Olivier Bodini, Eric Remila**Number:**RR2002-46**Date:**November 2002**Abstract:**

- We fix two rectangles with integer dimensions. We give a quadratic time algorithm which, given a polygon $F$ as input, produces a tiling of $F$ with translated copies of our rectangles (or indicates that there is no tiling). Moreover, we prove that any pair of tilings can be linked by a sequence of local transformations of tilings, called flips. This study is based on the use of J. H. Conway's tiling groups and extends the results of C. Kenyon and R. Kenyon (limited to the case when each rectangle has a side of length $1$).

**Keywords:**

- Tiling, Tiling Group, Height Function, Flip.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+15p**Format:**Compressed postscript- Get it

### Straight-line computation of the polynomial matrix inverse.

**By:**Claude-Pierre Jeannerod, Gilles Villard**Number:**RR2002-47**Date:**December 2002**Abstract:**

- We present an inversion algorithm for nonsingular n x n matrices whose entries are degree d polynomials over a field. The algorithm is deterministic and requires O~(n^3d) field operations for a generic input. The ``soft Oh'' notation O~ indicates some missing log(nd) factors. The polynomial matrix inverse can thus be computed by a straight-line program whose size is - asymptotically and up to logarithmic factors - the size of its output.

**Keywords:**

- Matrix Polynomial, Matrix Inversion, Minimal Basis.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+10p**Format:**Compressed postscript- Get it

### Choosing Starting Values for Newton-Raphson Computation of Reciprocals, Square-Roots and Square-Root Reciprocals.

**By:**Peter Kornerup and Jean-Michel Muller**Number:**RR2002-48**Date:**December 2002**Abstract:**

- We aim at finding the best possible seed values when computing reciprocals, square-roots and square-root reciprocals in a given interval using Newton-Raphson iterations. A natural choice of the seed value would be the one that best approximates the expected result. It turns out that in most cases, the best seed value can be quite far from this natural choice. When we evaluate a monotone function f(a) in the interval [a_{min},a_{max}], by building the sequence x_n defined by the Newton-Raphson iteration, the natural choice consists in choosing x_0 equal to the arithmetic mean of the endpoint values. This minimizes the maximum possible distance between x_0 and f(a). And yet, if we perform n iterations, what matters is to minimize the maximum possible distance between x_n and f(a).

**Keywords:**

- Computer Arithmetic, Newton-Raphson Iteration, Division, Square-Root, Square-Root Reciprocal.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+10p**Format:**Compressed postscript- Get it

### Minimal enclosing parallelepiped in 3D.

**By:**Frédéric Vivien, Nicolas Wicker**Number:**RR2002-49**Date:**December 2002**Abstract:**

- We investigate the problem of finding a minimal volume parallelepiped enclosing a given set of n three-dimensional points. We give two mathematical properties of these parallelepipeds, from which we derive two algorithms of theoretical complexity O(n^6). Experiments show that in practice our quickest algorithm runs in O(n^2) (at least for n <= 10 000). We also present our application in structural biology.

**Keywords:**

- Algorithmic Geometry, Parallelepiped, Bioinformatic.

**Availability:**Electronic copy only.**Citation:**Not published yet.**Size:**2+10p**Format:**Compressed postscript- Get it