NMepm

Since we are interested here only by probes that have an unique perfect match (PM) on the genome, we will use the NMepm perl script to extract these probes from the files previously generated by NMmpa. Before running NMepm we first need to generate a simple file that lists the name of the chromosomes for which we want to extract the PM probes. Here we will consider all the chromosome sequences. To get this file we can simply extract the chromosome names from the fasta file as follows:

$grep '^>' ../../Data/BY_S288c/Sequence/Genome.fasta \
      | grep -vw chrMito \
      | sed 's/^>//' > ../../Data/BY_S288c/Sequence/Chromosomes.txt

Note that here we removed the mitochondrial chromosome (chrMito) from the list. We can then run NMepm as follows:

$NMepm -i NMmpa/ -c ../../Data/BY_S288c/Sequence/Chromosomes.txt \
       -o NMepm -t 4 2>NMepm.log

where recall NMmpa contains the probe per strand annotation files previously generated with NMmpa, and -t 4 indicates to NMepm that we want probe genomic coordinates tiled with unit step of 4bp. As you can see, the program as created a directory called NMepm (-o NMepm) which contains as many .prb files as the number or chromosomes listed in ../../Data/BY_S288c/Sequence/Chromosomes.txt. Let's have a look at one this file:

$head -10 NMepm/chrI.prb
2147 833 2134628 chrI - 41 66 53 55 2
2199 1361 3486360 chrI + 45 70 57 59 2
953 1169 2993594 chrI - 49 74 61 63 2
1530 713 1826811 chrI + 53 78 65 67 2
1417 1001 2563978 chrI - 57 82 69 71 2
771 1239 3172612 chrI + 61 86 73 75 2
1745 1095 2804946 chrI - 65 90 77 79 2
991 1953 5000672 chrI + 69 94 81 83 2
295 1987 5087016 chrI - 73 98 85 87 2
540 1207 3090461 chrI + 77 102 89 91 2

All these .prb files are 10 columns table separated by a single whitespace. The columns are

the coordinate of the probe onto the array
the coordinate of the probe onto the array
the unique probe id
the name of the sequence on which the probe is located
the strand of the probe match (here remember that this is a unique perfect match)
the starting position of the match on the sequence
the ending position of the match on the sequence
the midpoint of the probe on the sequence
the tiling point of the probe on the sequence
the shift between the midpoint and the tiling point ranging from -3 to 3.

Let's explain the meaning of the last two columns. Since we are dealing here with tiling array, it is convenient for further analyses to have a probe coverage which respects everywhere the tiling step (here 4bp). That is to say, the physical distance between any pair of probes located on the same chromosome has to be a multiple of the tiling step. To insure this constrain is fulfill everywhere in the genome, the program NMepm implements a simple dynamic algorithm that first identifies chunks of consecutive probes that respect this constrain and then tries to optimize the positionning of these chunks with respect to each other so that it obtains the optimal positionning without shifting the individual probe midpoints by more than 3bp. The final position of the probe is then called the tiling point.

Finally, we will extract only 4 columns from these .prb files in order to create a unique file which could be used by NMcl2tab to extract data from raw Affymetrix .CEL. files. To do that, let's use a simple inline command:

$awk '{print $4" "$9" "$1" "$2}' NMepm/*.prb > BY_S288c.prb

Jean-Baptiste Veyrieras 2010-05-28