High-throughput sequencing technology has been crucial for rapid advances in functional genomics. The most important result is the discovery of thousands of non-coding RNAs (ncRNAs) which are able to fine-tune the expression of many genes involved in cell development, differentiation, apoptosis and proliferation [1]. Among ncRNAs, the most investigated are the microRNAs (miRNAs), small molecules (20-22 nucleotide long) that play the role of post-transcriptional regulators [2]. Much less is known about the functional role of long non-coding RNAs (lncRNAs), RNA molecules longer than 200 nt, that have been recently discovered to have a plethora of regulatory functions spanning from epigenetics to post-transcriptional regulation [3]. However, the number of lncRNAs for which the functional characterisation is available is still quite poor. Most of existing approaches are based on expensive experimental evaluations or on computational methods that exploit known/verified relationships among the lncRNA and the disease [4]. Some recent works consider the assumption that all the instances follow the same probability distribution and that are independent to each other. In this case such assumption is easily violated, since different lncRNAs can be involved in the development of the same disease, as well as different diseases can be related to each other on the basis of the involvement of common lncRNAs. To overcome these limitations we propose a computational method which is able to predict possible unknown relationships between lncRNA and diseases by exploiting different information about an heterogeneous set of (related) biological entities. In particular, we focus on lncRNAs, miRNAs, target genes and diseases, as well as on known relationships among these entities. The proposed method is based on a clustering algorithm which is able to group objects of multiple types and to predict possible unknown relationships on the basis of the extracted clusters. Moreover, the proposed clustering algorithm is designed to identify highly cohesive, possibly overlapping and hierarchically organised clusters, since i) the same lncRNA/disease can be involved in multiple networks of relationships and ii) as shown in [5], clusters at different levels of the hierarchy can describe more specific or more general relationships and cooperation activities.
Multi-type clustering for the identification of lncRNA-disease relationships
Domenica D'Elia;
2016
Abstract
High-throughput sequencing technology has been crucial for rapid advances in functional genomics. The most important result is the discovery of thousands of non-coding RNAs (ncRNAs) which are able to fine-tune the expression of many genes involved in cell development, differentiation, apoptosis and proliferation [1]. Among ncRNAs, the most investigated are the microRNAs (miRNAs), small molecules (20-22 nucleotide long) that play the role of post-transcriptional regulators [2]. Much less is known about the functional role of long non-coding RNAs (lncRNAs), RNA molecules longer than 200 nt, that have been recently discovered to have a plethora of regulatory functions spanning from epigenetics to post-transcriptional regulation [3]. However, the number of lncRNAs for which the functional characterisation is available is still quite poor. Most of existing approaches are based on expensive experimental evaluations or on computational methods that exploit known/verified relationships among the lncRNA and the disease [4]. Some recent works consider the assumption that all the instances follow the same probability distribution and that are independent to each other. In this case such assumption is easily violated, since different lncRNAs can be involved in the development of the same disease, as well as different diseases can be related to each other on the basis of the involvement of common lncRNAs. To overcome these limitations we propose a computational method which is able to predict possible unknown relationships between lncRNA and diseases by exploiting different information about an heterogeneous set of (related) biological entities. In particular, we focus on lncRNAs, miRNAs, target genes and diseases, as well as on known relationships among these entities. The proposed method is based on a clustering algorithm which is able to group objects of multiple types and to predict possible unknown relationships on the basis of the extracted clusters. Moreover, the proposed clustering algorithm is designed to identify highly cohesive, possibly overlapping and hierarchically organised clusters, since i) the same lncRNA/disease can be involved in multiple networks of relationships and ii) as shown in [5], clusters at different levels of the hierarchy can describe more specific or more general relationships and cooperation activities.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.