Creating a general script

Since we are now familiar with the different stages of the PM probe annotation pipeline, we can now try to create a shell script that will help us to quiclky reproduce these steps, in particular for a new strain (provided that we have the genome sequence of this strain, of course). The script could then look like that:
#! /bin/bash

##
# Arguments
##
strain=$1         # The name of the strain
genome_fasta=$2   # The genome sequence in fasta format
probe_fasta=$3    # The probe sequences in fasta format
array_size=$4     # The dimension of the array (yeast tiling array = 2560)
output_dir=$5     # The output directory

output_dir=${output_dir}/${strain}
##
# Create the output directory
##
mkdir -p $output_dir
##
# Map the probes on the genome using MUMmer
##
echo -n "Map probes on the genome: ..."
NMmp2g $genome_fasta $probe_fasta $output_dir/NMmp2g.out 2>$output_dir/NMmp2g.log
echo " [ OK ]"
# Extract the probes that have matche(s) on the genome and output the result
# by chromosome and by strand (+/-) into folder NMmpa
echo -n "Extract and format probe matches: ..."
NMmpa -i $output_dir/NMmp2g.out -o $output_dir/NMmpa -a $array_size
echo " [ OK ]"
# Now, extract only probes that have a unique perfect match
# on the genome (both strands) and remobed probe which do not match at an
# expected tiling step (4bp).
echo -n "Extract and format PM probes: ..."
grep '^>' $genome_fasta | sed 's/^>//' > $output_dir/seqid.txt
NMepm -i $output_dir/NMmpa -c $output_dir/seqid.txt -o $output_dir/NMepm -t 4 2>$output_dir/NMepm.log
awk '{print $4" "$9" "$1" "$2}' ${output_dir}/NMepm/*.prb > ${output_dir}/${strain}.prb
echo " [ OK ]"

This bash script has been implemented in the program NMstrain2array that we can use now to get the list of PM probes on RM_11-1a (RM) genome.

$NMstrain2array RM_11-1a ./Data/RM_11-1a/Sequence/Genome.fasta \ 
                ./Data/Affymetrix/S.cerevisiae_tiling.fasta 2560 S1
Map probes on the genome: ... [ OK ]
Extract and format probe matches: ... [ OK ]
Extract and format PM probes: ... [ OK ]

Finally, we have now two files, BY_S288c.prb and RM_11-1a.prb that give us the probes that have a unique PM on each genome. We can then get an idea of the number of probes that are shared between the two strains as follows:

$cat S1/BY_S288c/BY_S288c.prb S1/RM_11-1a/RM_11-1a.prb \
     | cut -d ' ' -f 3,4 | sort | uniq -c \
     | awk '{print $1}' | sort | uniq -c
 368639 1
2501942 2
So,BY and RM share 2,501,942 probes and there are 368,639 probes that are specific to either BY and RM. We will now use these two files for the next section.

Jean-Baptiste Veyrieras 2010-05-28