NMepm
perl script to extract these probes from the files previously generated by NMmpa
. Before running NMepm
we first need to generate a simple file that lists the name of the chromosomes for which we want to extract the PM probes. Here we will consider all the chromosome sequences. To get this file we can simply extract the chromosome names from the fasta file as follows:
$grep '^>' ../../Data/BY_S288c/Sequence/Genome.fasta \ | grep -vw chrMito \ | sed 's/^>//' > ../../Data/BY_S288c/Sequence/Chromosomes.txt
Note that here we removed the mitochondrial chromosome (chrMito) from the list. We can then run NMepm
as follows:
$NMepm -i NMmpa/ -c ../../Data/BY_S288c/Sequence/Chromosomes.txt \ -o NMepm -t 4 2>NMepm.log
where recall NMmpa contains the probe per strand annotation files previously generated with NMmpa
, and -t 4
indicates to NMepm
that we want probe genomic coordinates tiled with unit step of 4bp. As you can see, the program as created a directory called NMepm (-o NMepm
) which contains as many .prb
files as the number or chromosomes listed in ../../Data/BY_S288c/Sequence/Chromosomes.txt. Let's have a look at one this file:
$head -10 NMepm/chrI.prb 2147 833 2134628 chrI - 41 66 53 55 2 2199 1361 3486360 chrI + 45 70 57 59 2 953 1169 2993594 chrI - 49 74 61 63 2 1530 713 1826811 chrI + 53 78 65 67 2 1417 1001 2563978 chrI - 57 82 69 71 2 771 1239 3172612 chrI + 61 86 73 75 2 1745 1095 2804946 chrI - 65 90 77 79 2 991 1953 5000672 chrI + 69 94 81 83 2 295 1987 5087016 chrI - 73 98 85 87 2 540 1207 3090461 chrI + 77 102 89 91 2
All these .prb
files are 10 columns table separated by a single whitespace. The columns are
NMepm
implements a simple dynamic algorithm that first identifies chunks of consecutive probes that respect this constrain and then tries to optimize the positionning of these chunks with respect to each other so that it obtains the optimal positionning without shifting the individual probe midpoints by more than 3bp. The final position of the probe is then called the tiling point.
Finally, we will extract only 4 columns from these .prb
files in order to create a unique file which could be used by NMcl2tab
to extract data from raw Affymetrix .CEL.
files. To do that, let's use a simple inline command:
$awk '{print $4" "$9" "$1" "$2}' NMepm/*.prb > BY_S288c.prb
Jean-Baptiste Veyrieras 2010-05-28