Recent studies have demonstrated an unexpected complexity of transcription in eukaryotes. Indeed the majority of the genome is transcribed and only a little fraction of these transcripts is annotated as protein coding genes and their splice variants. Therefore high throughput transcriptome sequencing continuously identifies novel RNAs and novel classes of RNAs, which are the result of antisense, overlapping and non-coding RNA expression, demonstrating that the transcriptome captures a level of complexity that the simple genome sequence may not (1). Among next-generation sequencing platforms, the latest series of Roche 454 GS Sequencer, the GS FLX Titanium FLX+, allows to obtain in each run over a million reads, each with a length up to 700 base. Sequences of such length, providing connectivity information among splicing sites, in addition to enabling accurate mapping and relative quantification of mRNAs, are particularly suitable for the characterization of full-length splicing variants that may be differently expressed in physiopathological conditions (2). On the other hand the higher throughput of the Illumina HiSeq 1000 (150 bp) and ABI SOLID (75 bp) platforms, makes them particularly suitable for transcripts level quantification and for small RNAs sequencing. Irrespectively of the NGS platform used, the first step required for transcriptome sequencing is the construction of a cDNA library. Several protocols have been developed so far to this aim and each of them is suitable for sequencing on a specific platform exclusively. Here we describe a new fast and simple method (Patent pending RM2010A000293- PCT/IB2011/052369) to prepare and amplify a representative and strand-specific cDNA library starting from low input total RNA (500ng) for RNA-Seq applications, that may be implemented with all major platforms currently available (Roche 454, Illumina, ABI/Solid). Our method includes the following steps: a) rRNA removal from total RNA b) retrotranscription of the rRNA-depleted RNA to cDNA with 5' phosphorylated Tag-random-octamers custom designed capable of preserving strand information; c) single-strand cDNAs purification; d) ligation and amplification of the purified cDNAs, thus obtaining high yield of concatamers around 20kb long. These DNA molecules can be equally sequenced both with Illumina and Roche 454 sequencing platforms allowing not only the quantitative but also the qualitative assessment of the transcriptome complexity. Moreover, we developed a suitable bioinformatic pipeline for the analysis of the sequences produced upon application of this protocol. Indeed, we developed an in house python script, named Tag_Find (available upon request), able to recognize the position and the type of tag found within the read sequence. The program returns out two files, one containing the type of tags found and their reads positions and one fastq file with non-tagged reads, cleaned up from tags. The Tag_Find efficiency was tested on an artificial dataset of 454 reads, constructed by mimicking the specific structure of cDNA libraries used in this experiment. All the reads obtained upon the tags elimination were mapped onto the hg18/NCBI36 release of the human genome, using the NGS-Trex system (http://www.ngstrex. org/) with a userdefined preset of parameters. edgeR (3) and goseq (4) packages of Bioconductor were used for the differential gene expression analysis on genic mapped reads. For validation purposes, we tested the efficiency of this strategy by analyzing the transcriptome of two xenograft tumor masses derived from the injection in nude mice of an osteosarcoma cell line (OSC) with a nearly-homoplasmic mitochondrial Complex I disruptive mutation (m.3571insC) in the MT-ND1 gene. The xenografts shared the same nuclear genome, but carried a different m.3571insC mutant load, which was previously shown to be the determinant in the definition of non-proliferating versus proliferating and aggressive tumor phenotype (5). An average of 500.000 reads per sample was produced within the 454 RNAseq experiments, with a mean length of 320nt (very close to the values corresponding to the highest performances of the 454 GS FLX pyrosequencer, indicated by the manufacturer) for both samples. 2,546 differentially expressed (DE) genes were found, with a pvalue<= 0.01 as threshold and a maximum False Discovery Rate (FDR) value of 4.6% among all the genes expressed in both samples. A group of differentially expressed genes found in the experiment were used for the Real Time qPCR validation that confirmed the RNA-seq results. Altogether the results presented here demonstrated that our method for the construction of a representative cDNA library, combined with a specific downstream bioinformatics pipeline provide a powerful tool for the whole transcriptome interrogation using single or multiple NGS platforms from which an accurate quantitative and qualitative portrait of complex transcriptome can be generated. References 1. Forrest AR, Carninci P. (2009) Whole genome transcriptome analysis. RNA Biol.;6 (2):107-12. 2. Valletti A, Anselmo A, Mangiulli M, Boria I, Mignone F, Merla G, D'Angelo V, Tullo A, Sbisà E, D'Erchia AM, Pesole G (2010). Identification of tumor-associated cassette exons in human cancer through EST-based computational prediction and experimental validation. Mol Cancer. Sep 2; 9:23. 3. Robinson MD, McCarthy DJ, Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1):139-140. 4. Young MD, Wakefield MJ, Smyth GK, Oshlack A (2010). Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biol, 11(2):R14. 5. Gasparre G, Kurelac I, Capristo M, Iommarini L, Ghelli A, Ceccarelli C, Nicoletti G, Nanni P, De Giovanni C, Scotlandi K et al (2011). A mutation threshold distinguishes the antitumorigenic effects of the mitochondrial gene MTND1, an oncojanus function. Cancer Res, 71(19):6220-6229

A NOVEL GENERAL-PURPOSE RNA-SEQ PROTOCOL OPTIMIZING THE DETECTION OF TRANSCRIPTOME EXPRESSION COMPLEXITY

Caterina Manzari;MarianoFrancesco Caratozzolo;Flaviana Marzano;DomenicaD'Elia;Flavio Licciulli;Sabino Liuni;Graziano Pesole;Apollonia Tullo
2012

Abstract

Recent studies have demonstrated an unexpected complexity of transcription in eukaryotes. Indeed the majority of the genome is transcribed and only a little fraction of these transcripts is annotated as protein coding genes and their splice variants. Therefore high throughput transcriptome sequencing continuously identifies novel RNAs and novel classes of RNAs, which are the result of antisense, overlapping and non-coding RNA expression, demonstrating that the transcriptome captures a level of complexity that the simple genome sequence may not (1). Among next-generation sequencing platforms, the latest series of Roche 454 GS Sequencer, the GS FLX Titanium FLX+, allows to obtain in each run over a million reads, each with a length up to 700 base. Sequences of such length, providing connectivity information among splicing sites, in addition to enabling accurate mapping and relative quantification of mRNAs, are particularly suitable for the characterization of full-length splicing variants that may be differently expressed in physiopathological conditions (2). On the other hand the higher throughput of the Illumina HiSeq 1000 (150 bp) and ABI SOLID (75 bp) platforms, makes them particularly suitable for transcripts level quantification and for small RNAs sequencing. Irrespectively of the NGS platform used, the first step required for transcriptome sequencing is the construction of a cDNA library. Several protocols have been developed so far to this aim and each of them is suitable for sequencing on a specific platform exclusively. Here we describe a new fast and simple method (Patent pending RM2010A000293- PCT/IB2011/052369) to prepare and amplify a representative and strand-specific cDNA library starting from low input total RNA (500ng) for RNA-Seq applications, that may be implemented with all major platforms currently available (Roche 454, Illumina, ABI/Solid). Our method includes the following steps: a) rRNA removal from total RNA b) retrotranscription of the rRNA-depleted RNA to cDNA with 5' phosphorylated Tag-random-octamers custom designed capable of preserving strand information; c) single-strand cDNAs purification; d) ligation and amplification of the purified cDNAs, thus obtaining high yield of concatamers around 20kb long. These DNA molecules can be equally sequenced both with Illumina and Roche 454 sequencing platforms allowing not only the quantitative but also the qualitative assessment of the transcriptome complexity. Moreover, we developed a suitable bioinformatic pipeline for the analysis of the sequences produced upon application of this protocol. Indeed, we developed an in house python script, named Tag_Find (available upon request), able to recognize the position and the type of tag found within the read sequence. The program returns out two files, one containing the type of tags found and their reads positions and one fastq file with non-tagged reads, cleaned up from tags. The Tag_Find efficiency was tested on an artificial dataset of 454 reads, constructed by mimicking the specific structure of cDNA libraries used in this experiment. All the reads obtained upon the tags elimination were mapped onto the hg18/NCBI36 release of the human genome, using the NGS-Trex system (http://www.ngstrex. org/) with a userdefined preset of parameters. edgeR (3) and goseq (4) packages of Bioconductor were used for the differential gene expression analysis on genic mapped reads. For validation purposes, we tested the efficiency of this strategy by analyzing the transcriptome of two xenograft tumor masses derived from the injection in nude mice of an osteosarcoma cell line (OSC) with a nearly-homoplasmic mitochondrial Complex I disruptive mutation (m.3571insC) in the MT-ND1 gene. The xenografts shared the same nuclear genome, but carried a different m.3571insC mutant load, which was previously shown to be the determinant in the definition of non-proliferating versus proliferating and aggressive tumor phenotype (5). An average of 500.000 reads per sample was produced within the 454 RNAseq experiments, with a mean length of 320nt (very close to the values corresponding to the highest performances of the 454 GS FLX pyrosequencer, indicated by the manufacturer) for both samples. 2,546 differentially expressed (DE) genes were found, with a pvalue<= 0.01 as threshold and a maximum False Discovery Rate (FDR) value of 4.6% among all the genes expressed in both samples. A group of differentially expressed genes found in the experiment were used for the Real Time qPCR validation that confirmed the RNA-seq results. Altogether the results presented here demonstrated that our method for the construction of a representative cDNA library, combined with a specific downstream bioinformatics pipeline provide a powerful tool for the whole transcriptome interrogation using single or multiple NGS platforms from which an accurate quantitative and qualitative portrait of complex transcriptome can be generated. References 1. Forrest AR, Carninci P. (2009) Whole genome transcriptome analysis. RNA Biol.;6 (2):107-12. 2. Valletti A, Anselmo A, Mangiulli M, Boria I, Mignone F, Merla G, D'Angelo V, Tullo A, Sbisà E, D'Erchia AM, Pesole G (2010). Identification of tumor-associated cassette exons in human cancer through EST-based computational prediction and experimental validation. Mol Cancer. Sep 2; 9:23. 3. Robinson MD, McCarthy DJ, Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1):139-140. 4. Young MD, Wakefield MJ, Smyth GK, Oshlack A (2010). Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biol, 11(2):R14. 5. Gasparre G, Kurelac I, Capristo M, Iommarini L, Ghelli A, Ceccarelli C, Nicoletti G, Nanni P, De Giovanni C, Scotlandi K et al (2011). A mutation threshold distinguishes the antitumorigenic effects of the mitochondrial gene MTND1, an oncojanus function. Cancer Res, 71(19):6220-6229
2012
Istituto di Tecnologie Biomediche - ITB
NGS
454 ROCHE
Genomics
Epigenomics
Transcriptomics
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/310843
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact