A deep learning approach for ncRNA sequences classification

Fiannaca, Antonino; La Rosa, Massimo; La Paglia, Laura; Rizzo, Riccardo; Urso, Alfonso

Introduction Thanks to the development of Next Generation Sequencing (NGS) techniques and the following availability of huge amounts of genomic data, non-coding RNA (ncRNA) sequences have gained a growing interest by scientific community. ncRNAs are, in fact, small non-coding sequences and recent studies have demonstrated they are involved in different biological processes, including diseases. There are several kinds of ncRNAs, which differ each other on the basis of their length, structure (folding) and function. Among the most studied ncRNAs there are, for example, transfer RNA (tRNA), related to translation event; micro RNA (miRNA), that acts as post-transcriptional regulators by binding to specific RNA messengers (mRNA target); snoRNA, that are involved in the post-translational modification of ribosomal RNA (rRNA) and riboswitches that are able to bind certain metabolites, regulating this way gene expression. In this context, it is fundamental to investigate a computational method that can provide identification and classification of different kinds of ncRNA molecules. In this work we propose a pipeline for classification of ncRNA sequences based on structural features extracted from RNA secondary structure and a deep learning architecture implementing a convolutional neural network. Methods Given a dataset of ncRNA sequences in FASTA format, the secondary structure of each sequence is estimated using the IPknot structure predictor. Secondary structure, in fact, can offer more insights about ncRNA biological functions and family type rather than the simple primary structure. Secondary structure of ncRNA can be seen as an undirected graph, where nodes are nucleotides and edges represent bonds among them. Our hypothesis is that ncRNA sequences belonging to the same class share similar sub-structures, i.e sub-graphs. We considered those sub-graphs as local features for classification purpose, in order to discriminate among different ncRNA families. The identification and extraction of frequent sub-graphs is performed by means of the Molecular Substructure Miner (MoSS) algorithm. After applying the MoSS algorithm to the set of secondary structures, it is obtained a binary matrix data. Each ncRNA input sequence is then represented as a boolean feature vector, with 1s corresponding to the presence of a given subgraph, and 0s otherwise. This data representation can be used as training set for a machine learning supervised classification algorithm. In this work, we adopted a deep learning architecture based on a convolutional neural network (CNN). A CNN is a multilayer neural network that alternates convolutional filters and pooling modules, with a multilayer perceptron as last component. Results Our classification pipeline has been tested for the classification of 6320 ncRNA sequences belonging to 13 different families. The MoSS algorithm allows to set up the minimum (min) and maximum (max) size of the extracted subgraphs. According to these values, it is possible to obtain different numbers of input features. For this reason, we performed several experiments with regards to the number of available features and considering a tenfold cross validation procedure. The best result were obtained considering min size = 4 and max size = 6, corresponding to 6443 features. As for the CNN parameters, after many trials we chose a configuration that represents a good trade-off between execution times and goodness of results. We adopted a two layers CNN, with a convolutional kernel size set to 5 for both layers and considering 10 kernels for the first layer and 20 kernels for the second layer, respectively. The pool size was set to 2 for both layers. Classification results have been computed in terms of accuracy, sensitivity, specificity, precision and F-score, and they were compared with the results obtained by using other four state-of-the-art classifiers, namely random forest (RF), naive Bayes (NB), k-nearest neighbour (kNN) and support vector machine (SVM).

CNR Institutional Research Information System