BIP bugs

This page gives the status of fixed and pending bugs the BIP distribution.

Year 1999

*   November 9, BIP 0.99a did not compile with uniprocessor kernels
*   October 10, BIP 0.98 has a few major bugs, use version 0.98a instead
*   October 5, biproute seems to work but reports failure

With new BIP relases from september, previous erratas are now obsolete
Year 1998

*   November 2, LANAI 4.3 based cards from Myricom
*   October 10, the "fatal error detected on lanai" syndrome
*   June 1, bip_tprobe does not behave as documented
*   April 11, Message with size not multiple of 4 bytes
*   March 26, Error: "dio_mlock: out of memory"
*   March 2, Error: "bipcc not in standard arborescence"
*   Janvier 8, BIP does not work with GLIBC2 (Redhat 5.0, Debian 2.0)

Year 1997

*   Novembre 5, Some incompatibilities with Portland Group Compilers
*   October 26, BIP0.93a binary not working properly
*   August 7, BIP not working with SMP kernel
*   August 3, ipset putting suffix at wrong place
*   June 21, biproute failing with csh/tcsh shell
*   June 15, module for 2.0.30
*   June 15, IP driver without switch
*   June 10, host names longer than 10 caracters
*   June 7, CRC errors ignored
*   June 5, creation of /dev/dio incorrectly documented

BIP 0.99a did not compile with uniprocessor kernels

Corrected in 0.99c

Compilation of bipker.c was failing, the fix is removing the line including asm/smplock.h or downlading bip-0.99c.

BIP 0.98 is broken, use version 0.98a instead

Corrected in 0.98a

A few important bugs escaped the testing done for the V0.98 release:
  • The IP network driver was not tested and actually did not work at all.
  • Compilation fail if you call configure with a relative path.
  • BIP may fail at runtime in some full-duplex communication patterns.

All these bugs are solved in BIP 0.98a.

biproute seems to work but reports failure

It is recommended that biproute be run from the first machine in the list.

While running biproute, files are normally created in $HOME/.bip on the first machine in the list (with bipmapper launched via rsh). Close to the end, biproute might report the error than $HOME/.bip/map/mapper.routes does not exists because of NFS propagation delays (the rsh completion should imply that the files have been written, but the machine running biproute may still be unable to see them).

The older erratas below are obsolete and will generally provide no clue when using BIP versions >= 0.97

BIP driver does not load on LANAI4.3 based cards

Corrected in 0.95c

When trying to "insmod" the bip driver on recent Myricom cards, it was failing with "device or resource busy"

It was because of a conservative check in the code about the different LANAI versions that were known to work.

fatal error detected on lanai

Corrected in 0.95a

Some BIP or MPI programs were failing with message:
 fatal error detected on lanai: error on LANAI

 ******************************
 DEBUGGING INFO for NODE 1:
 ...
In most cases, these messages were probably due to the failure of BIP and MPI, to meet some real-time constraints with the Myrinet network. This problems are compltely resolved in the recent versions of MPI (please tell us if you have still any problems).

bip_tprobe does not behave as documented

Corrected by updating the documentation

There was a big discrepancy between the documentation and the actual behaviour of bip_tprobe (or bip_probe)..

The documentation was saying that bip_probe return the message size in bytes or words(4 bytes), this is wrong. If there is no message available, it returns 0. If there is some messages, it returns 1.

Message with size not multiple of 4 bytes

Corrected in 0.94c

BIP-MPI had often wrong behaviour when dealing with MPI message that are not a multiple of 4 bytes, it lead to programs hangs. Errors were more likely with collective operations.

Be aware that plain BIP (as opposed to MPI-BIP) requires sizes multiple of 4 bytes and proper alignement.

Error: "dio_mlock: out of memory"

Corrected in 0.94

When sending messages or receiving messages, BIP needs to lock the buffer into physical memory, the amount of memory that can be locked is limited, so if you use different buffers each time, you can encounter a variant an "out of memory" message (although you have enough swap and physical memory).

Starting from 0.94, BIP tries to unlock memory when needed, so this problem should be resolved.

Error: "bipcc not in standard arborescence"

Corrected in 0.94

When using bip utilities, if you have the message like:
 Error: "bipcc not in standard arborescence"
Check your PATH variable, the .../bip/bin component should not end with a slash, this is the cause of the error.
This bug does not occur in BIP version after 0.94.

BIP does not work with glibc2 (Redhat 5.0, Debian 2.0)

Corrected in BIP-glibc2_0.94a

The Linux distributions are switching to a new C library: GLIBC2 aka Libc6 (the previous one was libc5). This new C library is a major change (comparable to the switch from a.out to ELF format), it is not binary compatible with the preivous version. Generally speaking, that means any ".a" library (like those provided by BIP) compiled on a libc5 system will not be usable on a GLIBC2 system (for instance redhat5.0 and debian2.0), the opposite is also true.

Trying to use the "libc5-BIP" libraries on a libc6 system (redhat5.0, debian 2.0 and later release) will generate an error at link time like:
undefined reference to __init_brk in hook.o

Using the new GLIBC2-BIP libraries on a libc5 system may provoke undefined behaviour (included corrupt the memory of your computer as BIP does some low-level things like DMA).

Starting from the BIP0.94 version, we provide two different release, one for the old libc5 systems: the distribution is named like:
bip_mpi-0.xx-bin.tar.gz
and one for the GLIBC2 systems, the distribution is named like.
bip_mpi_glibc2-0.xx-bin.tar.gz

incompatibilities with Portland Group compilers

The BIP distribution contains a bip_pgf77 script to call the pgf77 Fortran compiler instead of the Gnu g77 compiler.

We made tests with gcc as the C compiler and pgf77 for Fortran, this combination seems to work.

It looks like using pgcc with BIP does not work, do not use any combination using pgcc

When using pgf77, be sure to either link with bip_pgf77, or to tells the compiler the full path to the bip/mpi_pg/{libfmpi,libmpi}.a libraries.

BIP 0.93a does not work correctly

Corrected in BIP-0.93b

The 0.93a release of BIP was done to add support for Portland Group Fortran Compiler. Unfortunately the BIP libraries were not recompiled properly (some old objects were not recompiled because of bad dependencies while preparing the release).

So only very trivial programs were working with this 0.93a release, most programs failed at the first communications.

Note that this only affect the publically available release limited to 4 processors. The unlimited version was correctly generated.

Please upgrade to the BIP0.93b. "0.93a" will definitely not work.

BIP does not work with SMP kernel

By default, BIP will not work for a SMP kernel (even if you have only a one processor computer)
To know if your kernel is SMP, check the line "SMP = 1" in the main Makefile of the Linux kernel source.

A workaround (not guaranteed) to use plain BIP and MPI-BIP with an SMP kernel is to add the compilation flags -D__SMP__ to the bip/kernel/Makefile and to recompile the dio module. Not that still only one processs on each computer may be used as part of a BIP application.

Note that the bipip network driver will NOT work on a SMP kernel. Mail us if you want to give a try with such configuration.

ipset put suffix after domain name

Corrected in BIP-0.92b

When configuring ip-bip with ipset, if hostname is a fully qualified name, ipset search the ip address for <fqdn><suffix> instead of looking for <localname><suffix><domain>

A temporary workaround is to put IP adresses with the name ipset looks for into /etc/hosts, for instance, with suffix -t:

192.168.0.2 lhpcd.univ-lyon1.fr-t

Or alternatively to use short name when configuring each machine hostname.

biproute is failing for users with a csh/tcsh shell

Corrected in BIP-0.9b

biproute execute commands on remote machines via rsh For users with csh shells, it is failing with a message like:

unexpected output:00:ff:...

Linux module for 2.0.30 was the same as 2.0.29

Corrected in BIP-0.9a

The BIP distribution contains four version of the BIP-IP driver: for linux 2.0.29 or 2.0.30 with both "versioned" and "non-versioned" modules.

The 2.0.30 modules were in fact incorrectly generated and will refuse to be loaded.

Bip IP driver not working on no switch links


Corrected in BIP-0.9a

When using a direct connection between two stations with no switch between, there was a bug in the IP driver that will cause the generation of messages like:

No myrinet route for ...

And all packets will be dropped.

Thanks to Valentin Puente for reporting the problem.


Host names longer than 10 caracters

Corrected in BIP-0.9

BIP-0.7 and BIPIP-0.8 limit machines hostnames to 10 caracters maximum, this is corrected in BIP-0.9 (hostnames up to 255 caracters). What count here is the length of the hostname as returned by `uname -n`, not the fully qualified name (but they can be the same).

The problem will only occur at execution with messages such as:

maragota.atc.unican.es: I am not in the config file

Thanks to Valentin Puente for reporting the problem.


CRC errors ignored

Corrected in BIP-0.9

BIP-0.7 and BIPIP-0.8 will ignore most CRC errors due to transmission errors on the Myrinet network. The applications can be fed with corrupted data. On detection of a CRC error, the BIP MCP is just incrementing an internal counter and fails to report the error to the application.

In BIP-0.9, any CRC error on the network will stop the application with an error message. In BIPIP-0.9, such error packets will be silently drop, the number of such errors can be looked at with ifconfig in the field rx errors.

Thanks to Olivier Dalle for reporting the problem.


Creation of /dev/dio is incorrectly documented

Corrected in BIP-0.9

The README for BIP0.7 incorrectly said
3) as root, install the kernel module on each computer of the
configuration, first create a new device with:
 3.1)   mknod /dev/dio b 61 0 
The application will abort on message send greater than 148 bytes with the following message:

opening /dev/dio: No such device

In fact the c should have been b so you must do instead:

 3.1)   mknod /dev/dio c 61 0 
                      ^^^         
/dev/dio should look like:

crw-r--r-- 1 root root 61, 0 Feb 24 21:35 /dev/dio

Thanks to Olivier Dalle pointing out the problem.

Home
Last modified: Wed Nov 10 00:03:03 MET 1999  
© BIP team