Skip to content. | Skip to navigation

Personal tools

You are here: Home / Teams / Systems Biology of Decision Making - O. Gandrillon / Publications (not up to date) / Unexpected observations after mapping LongSAGE tags to the human genome.

Unexpected observations after mapping LongSAGE tags to the human genome.

Celine Keime, Marie Semon, Dominique Mouchiroud, Laurent Duret, and Olivier Gandrillon (2007)

BMC Bioinformatics, 8:154.

BACKGROUND: SAGE has been used widely to study the expression of known transcripts, but much less to annotate new transcribed regions. LongSAGE produces tags that are sufficiently long to be reliably mapped to a whole-genome sequence. Here we used this property to study the position of human LongSAGE tags obtainedfrom all public libraries. We focused mainly on tags that do not map to known transcripts. RESULTS: Using a published error rate in SAGE libraries, we first removed the tags likely to result from sequencing errors. We then observed that an unexpectedly large number of the remaining tags still did not match the genome sequence. Some of these correspond to parts of human mRNAs, such as polyA tails,junctions between two exons and polymorphic regions of transcripts. Another non-negligible proportion can be attributed to contamination by murine transcripts and to residual sequencing errors. After filtering out our data withthese screens to ensure that our dataset is highly reliable, we studied the tagsthat map once to the genome. 31% of these tags correspond to unannotated transcripts. The others map to known transcribed regions, but many of them (nearly half) are located either in antisense or in new variants of these known transcripts. CONCLUSION: We performed a comprehensive study of all publicly available human LongSAGE tags, and carefully verified the reliability of these data. We found the potential origin of many tags that did not match the human genome sequence. The properties of the remaining tags imply that the level of sequencing error may have been under-estimated. The frequency of tags matching once the genome sequence but not in an annotated exon suggests that the human transcriptome is much more complex than shown by the current human genome annotations, with many new splicing variants and antisense transcripts. SAGE data is appropriate to map new transcripts to the genome, as demonstrated by the highrate of cross-validation of the corresponding tags using other methods.

automatic medline import

Document Actions