Creating a NucleoMiner .db file

Most of the program implemented in NucleoMiner used a binary version of the tiling array datasets. To convert a tiling array dataset, as obtained previously, we will use the program NMtfb. This program requires a small configuration file that describes the organization of the dataset and if some simple pre-processing is needed to create the final dataset (that will then be used in subsequent analyses with NucleoMiner).

Since the next step of the tutorial deals with the inference of nucleosome occupancy, we will then prepare the corresponding dataset. As we will see in the next section, the inference of nucleosome occupancy is perfomed by fitting a Hidden Markov Model (HMM) to the average hybrization signal among replicates (and of course per strain).

The program NMtfb can be directly used to compute the average hybrization intensities and to store the result in an appropriate binary file that could be used in the next step. To do so, we need first to create the configuration file. For simplicity, we have created this file for you: Data/BY_S288c/Array/NucOccupancy.dbconf. Let's have a look at its content

cat Data/BY_S288c/Array/NucOccupancy.dbconf
############################
# NMtdb configuration file #
############################
##
# The name of the dataset
##
name = BY_NucOccupancy
##
# Number of arrays to consider
##
narray = 3
##
# Membership of the arrays
##
groupId = 1 1 1
##
# Pre-processing algorithm
#    full   = do nothing
#    mean   = compute the mean across replicates (per group)
#    median = compute the median accross replicates (per group)
##
type = mean

As you can, by using the mean tag in the pre-processing section of the configuration file, we will ask NMtdb to compute for each probe of the input dataset the mean hybridization value and to consider it as the final datapoint. Let's run NMtdb:

$NMtdb -i Data/BY_S288c/Array/NucOccupancy_norm.txt \
       -c Data/BY_S288c/Array/NucOccupancy.dbconf \
       -o Data/BY_S288c/Array/NucOccupancy_norm.db
--
 tilingdb
--
 Read configuration file: [ OK ]
--
 Read data file: [ OK ].
--
 Save data: [ OK ]
--

Note that the Read data file: step can take some times depending on how big is the input dataset. The program has then created a new file Data/BY_S288c/Array/NucOccupancy_norm.db which is in fact a binary file,

$file Data/BY_S288c/Array/NucOccupancy_norm.db
Data/BY_S288c/Array/NucOccupancy_norm.db: data

So, don't try to open this file with a text editor or something else: it is very important to not modify this file by hand, otherwise you will experience some troubles in subsequent analyses that require this file. Besides, since this file is a binary file, it is specific to your machine architecture and operating system, so it is highly recommanded to not send this file to other users, unless they have exactly the same machine architecture and operating system. Unless, send the original input file togeter with the configuration file and the corresponding arguments of NMtdb.

Finally, we can have a look to the content of Data/BY_S288c/Array/NucOccupancy_norm.db by using the flag option --print of NMtdb as follows:

$NMtdb -i Data/BY_S288c/Array/NucOccupancy_norm.db --print | head -10
chromosome position A1
chrIII 38 1.332212e+01
chrIII 42 1.333482e+01
chrIII 226 1.473298e+01
chrIII 230 1.263931e+01
chrIII 234 1.470365e+01
chrIII 238 1.290735e+01
chrIII 242 1.438298e+01
chrIII 278 1.357645e+01
chrIII 358 1.241799e+01

As you can see there is only one column of hybrization intensity values since we ask NMtdb to compute the mean from the three replicates.



Subsections
Jean-Baptiste Veyrieras 2010-05-28