nc-aReNA: an integrated bioinformatics platform for non-coding RNA-seq data classification and annotation

De Caro, G; Tulipano, A; Consiglio, A; D'Elia, D; Grillo, G; Marinaro, M; Liuni, S; Licciulli, F; Gisel Aequal Contributors,

High-throughput technologies (HT), such as microarray and especially Next-Generation Sequencing (NGS) technologies, have provided tremendous potential for profiling protein-coding and non- protein coding RNAs (ncRNAs). Recent reports of the ENCODE project underline that while 80% of the human genome is transcribed, only 2% is protein coding, suggesting that the vast majority of the genome is transcribed as non-protein-coding RNA. We present the development of a web-based bioinformatics platform, nc-aReNA, for the mapping, classification and annotation of human and mouse ncRNAs from HT-NGS data. The platform is based on a data-warehouse approach and workflow environment that includes data quality control, genome and nc-RNAome sequence alignment, differential expression profiling analysis and statistics of classified data. Methods The nc-aReNA architecture is based on a modular analysis pipeline, flanked by a data-warehouse, for the classification and annotation of small-RNAseqdata. The pipeline takes in input the sequenced reads in FASTQ format. After the initial steps of adaptor removal and quality check, the input reads are mapped to an in-house non-redundant ncRNA reference database (http://ncRNAdb.ba.itb.cnr.it) which collects and integrates ncRNA gene lists, from MGI (Mouse Genome Informatics) and HGNC (Human Genome Nomenclature Committee), with sequences and biotype annotations from VEGA (Vertebrate Genome Annotation), ENSEMBL, RefSeq, RFam (for tRNA sequence) and miRBase (for miRNA). NGS reads mapped in this step are classified by using Sequence Ontology (SO) (Eilbeck K. et al., 2005). Unmapped reads are aligned to the reference genome and tagged to the corresponding genomic locus. Integrated statistics are used for RPM (Reads Per Million), fold changes and False Discovery Rate (FDR) corrected p-values calculation and differential expression analysis of all (or user-chosen) ncRNA classes, by comparing two or more experimental conditions or time-courses data. An additional module, called "miRNA identification", provides the analysis of all unmapped miRNA-like reads by mean of the miRDeep2 software. All the analysis results and annotation are stored in a data-warehouse implemented with Infobright (http://www.infobright.org). A user-friendly web-based Graphical User Interface (GUI), developed by using the JAVA platform, guides the user in the submission process and displays results in tables and graphs. Results The main features of the nc-aReNA are: - identification and classification of reads in known functional ncRNA categories in SO; - identification and filtering of reads mapping to ribosomal RNAs and mtDNA transcripts; - RPMs calculation for each known ncRNA; - the export of user-selected classesof ncRNA for further specific investigation; - quantification of ncRNAs expression and differential expression analysis for all identified ncRNAclasses; - graphical visualization of sample expression profiles; - additional annotation such as target genes and pathways for miRNAs; - prediction of unknown miRNAs; - genome alignment and mapping of unknown ncRNAs useful for subsequent prediction analysis of new ncRNAs; Currently, the nc-aReNA reference database section contains a total of 62439 (human) and 42330 (mouse) ncRNA sequences classified in 26 biotypes associated to SO terms. As for functional annotation, the data-warehouse contains, among others, information about 410,581 biochemical pathways and 121,579 experimentally validated miRNA target interactions. As test case, we have used Illumina small RNA-seq data produced for the expression profiling of smallRNAs in different tissues isolated from M. musculus. This experiment includes three time- courses and three technical replicates from two different tissues. Conclusions nc-aReNA is a web-based tool which provides a user-friendly GUI for the analysis and classification of human and mouse smallRNA-Seq datasets. It has been developed to support biologists and clinicians, with no prior specific computer science knowledge, in the biological interpretation of NGS data. The platform is designed as such as it can be used for any organism provided that basic information is available (i.e. reference genome and ncRNAs).

CNR Institutional Research Information System