Introduction High-throughput sequencing technology, alongside new or improved computational methods, have been crucial for rapid advances in functional genomics. Among the most important results obtained thanks to the introduction of these new technologies, there is the discovery of thousands of non-coding RNAs (ncRNAs) whose function is pivotal for the fine-tuning of the expression of many genes that guide cell development, differentiation, apoptosis and proliferation [2]. Therefore, in the last decade, the number of papers reporting evidences about ncRNAs involvement in human complex diseases, such as cancer, is grown at an exponential rate. Among the different classes of ncRNAs, the most investigated one is that of microRNAs (miRNAs), which are small molecules (20-22nt long) that regulate the expression of genes through the modulation of the translation of their transcripts [4]. Much less is known about the functional involvement of long non-coding RNAs (lncRNAs), represented by RNA molecules longer than 200 nt, that have been recently discovered to have a plethora of regulatory functions spanning from chromatin modifications to post-transcriptional regulation [8]. However, the number of lncRNAs for which the functional characterization is available is still quite poor. Assessing the role and, especially, the molecular mechanisms underlining the involvement of lncRNAs in human diseases, is not a trivial task. Most of existing approaches are based on expensive experimental evaluations or on computational methods which exploit known/verified relationships among the lncRNA and the disease [6]. However, because of the complex functional interactions that lncRNAs can establish with other regulatory RNAs (i.e., miRNAs) or proteins, considering only the evidences of a direct relationship between lncRNAs and diseases may be very limiting. Some recent works started to consider further related information, but they do not consider possible dependencies among the relationships, but analyze single relationships independently. This corresponds to the assumption that all the instances follow the same probability distribution and that are independent to each other. In this case such assumption is easily violated, since different lncRNAs can be involved in the development of the same disease, as well as different diseases can be related to each other on the basis of the involvement of common lncRNAs or other regulatory entities such as miRNAs. To overcome these limitations we propose a computational method which is able to predict possibly unknown relationships between lncRNA and diseases by exploiting different in- formation about an heterogeneous set of (related) biological entities. In particular, we focus on lncRNAs, miRNAs, target genes and diseases, as well as on known relationships among these entities (see Figure 1). The proposed method is based on a clustering algorithm which is able to group objects of multiple types and to predict possibly unknown relationships on the basis of the extracted clusters. Moreover, the proposed algorithm is designed to identify highly cohesive, possibly overlapping and hierarchically orga- nized clusters, since i) the same lncRNA/disease can be involved in multiple networks of relationships and ii) as shown in [9], clusters at different levels of the hierarchy can describe more specific or more general relationships and cooperation activities. Methods In this section, we describe the solution we propose, which is based on three main steps: i) estimation of the strength of the relationships between lncRNAs and diseases; ii) identification of a hierarchy of possibly overlapping and hierarchically organized clusters of lncRNAs and diseases; iii) identification of possibly unknown lcRNA-disease relationships. In the following, we briefly describe each single step. Estimation of the strength of the relationships. The first step of the method consists in the iden- tification of the strength of the relationship for each lncRNA-disease pair. Such strength is represented by a score, which is computed as: s(li, dj ) = 1, if there exists a direct known relationship between the lncRNA li and the disease dj in the data, i.e. if there exists the relationship in the table disease_lcrna (see Figure 1). Otherwise s(li, dj ) = scoreP aths(li, dj ), where scorePaths(li, dj ) computes such score by considering all the alternative paths in the scheme, that are: olncrna - lncrna_target - target - disease_target - disease olncrna - lncrna_mirna - mirna - mirna_target - target - disease olncrna - lncrna_mirna - mirna - disease_mirna - disease In particular, scoreP aths(li, dj ) is computed as the maximum score obtained over all the paths. For each path P , the score is the cosine similarity computed over all the attributes of the entities involved in the path connecting li and dj . It is noteworthy that this score already gives an idea of the possibility that such relationships exist. However, we exploit them to build the hierarchy of clusters which is able to estimate the score more reliably, since based on the whole set of relationships. Construction of the hierarchy of clusters. Once each pair has been associated with a score, we build a cluster for each pair having a score greater than a given threshold ?. The goal is to identify the first level of the hierarchy consisting of clusters in form of cliques, where each lncRNA is associated to each disease in the cluster, with a score greater than ?. At this aim, we apply an iterative process which: sorts the clusters according to a quality criterion; for each cluster, finds the first cluster which can be merged with it preserving the clique constraint. The process stops when no further merging is possible, forming the first level of the hierarchy. The quality criterion we use, inspired by [9], is the cluster cohesiveness, which is the average score of the all the pairs that can be identified from the cluster. In order to build further hierarchical levels, we repeat the same process, by relaxing the clique constraint and by introducing a quality threshold ? on the cohesiveness of the cluster after merging. In details, the iterative process stops (and return a new hierarchical level), when no merging leading to a cluster with a cohesiveness greater than ? can be performed. Prediction of the relationships. Once we obtain the hierarchy of clusters, we identify possibly un- known relationships for each level of the hierarchy. In particular, the prediction is performed by assigning to each possible lncRNA-disease pair the score computed as the cohesiveness of the cluster in which it falls. When a lncRNA-disease pair appears in multiple clusters, we combine the cohesiveness of the set of clusters to obtain the final score. Baseline combination strategies can be the maximum, the minimum and the average. In this work, we propose a different combination function, which rewards those cases in which the pair appears in several highly cohesive clusters (indicating a higher probability of existence). Formally, given Cij = [C1, C2, . . . , Cn], the list of the clusters in which the lcRNA li and the disease dj appear, sorted in descending order with respect to their cohesiveness values w1, w2, . . . , wn, the score of the pair li, dj is computed as G(Cn), where: G(C1) = w1 G(Ck ) = G(Ck-1) + (1 - G(Ck-1) · wk In the experiments, we call this combination function custom and also evaluate its effectiveness with respect to the baseline combination approaches. Results The proposed method has been implemented in the system LP-MTRCLUS Figure 1. Relational schema of the database considered in the analysis and the results of the preliminary experiments, at three different levels of the hierarchy. A point in the graphs repre- sents the percentage of true relationships discovered (Y-axis) when we take a given number of returned interactions (X-axis). (Link Prediction through Multi-Type CLUStering). The dataset considered in this preliminary experiment (showed in Figure 1 as a relational database) has been built by integrating several existing biological datasets: interaction between lncRNAs and diseases and between lncRNAs and their target genes from [3]; interactions between miR- NAs and lncRNAs from [5]; interactions between diseases and genes from DisGeNET [1]; interactions between miRNAs and genes and interactions between miRNAs and diseases from miR2Disease [7]. The obtained relational database consists of 7.050 diseases, 507 lncRNAs, 508 miRNAs, 94.527 genes, 953 interactions between diseases and lncRNAs, 2.877 interactions between diseases and miRNAs, 26.522 interactions between diseases and genes, 70 interactions between lncRNAs and miRNAs, 252 interactions between lncRNAs and genes, and 803 interactions between miRNAs and genes. In this preliminary experiment, we performed a comparison between LP-MTRCLUS and HOCCLUS2 [9] which is a biclustering algorithm that also builds a hierarchy of clusters. The evaluation has been performed by applying the 10-fold cross validation on the set of known lncRNA-disease relationships. We averaged the results obtained according to the true positive rate measure, which is defined as TP R = TP where TP is the number of lncRNA-disease relationships that have been discovered and that have also been validated in literature, whereas FN is the number of known lncRNA-disease relationships that the considered system was not able to predict. Both TP and FN have been computed according to a threshold on the relationship score. By moving such threshold we obtain a curve where a point represents the percentage of true relationships discov- ered when we take a given number of returned interactions (see Figure 1). Following the results in [9], for both systems (LP-MTRCLUS and HOCCLUS2) we set the value of ? to 0.2 and ? = ? - 0.2. For space constraints, we limit the results to the first three levels of the hierarchical clustering. In Figure 1 it is possible to observe that LP-MTRCLUS is able to outperform HOCCLUS2 in all the hierarchical levels when we use the custom combination function. Such combination function is also able to outperform the other combination strategies whose performances are often worse than HOCCLUS2. As expected, the min strategy is the most conservative, whereas the max strategy shows a trend which is similar to HOCCLUS2. avg strategy is always in the middle between MIN and MAX strategies. As a final remark, it is noteworthy that HOCCLUS2 was able to obtain comparable performances because we provided it with set of lncRNA-disease scores computed by our system, since, in its original form, it is not able to analyze a complex relational database. For this reason, we are currently performing further experiments with other competitors to evaluate the effect of the integrated information on the results. Conclusions In this work, we focused on the recognized role of lncRNAs in human diseases. In particular, we pro- posed a computational method which is able to predict possibly unknown lncRNA-disease relationships by exploiting a clustering algorithm which work on multiple types of objects. Preliminary experiments showed that the proposed method, especially when adopting the proposed combination strategy, is able to outperform the algorithm HOCCLUS2. Currently we are performing additional experiments with other competitor approaches to deeply evaluate the effectiveness of the clustering-based method for this pur- pose as well as the effect of the exploitation of information about related biological entities, such as miRNAs, genes and their relationships with diseases and lncRNAs. References 1.A. Bauer-Mehren, M. Rautschka, F. Sanz, and L. I. Furlong. DisGeNET: a Cytoscape plugin to visualize, integrate, search and analyze gene-disease networks. Bioinformatics, 26(22):2924-2926, 2010. 2.T. Cech and J. Steitz. The Noncoding {RNA} Revolution--Trashing Old Rules to Forge New Ones. Cell, 157(1):77 - 94, 2014. 3.G. Chen, Z. Wang, D. Wang, C. Qiu, M. Liu, X. Chen, Q. Zhang, G. Yan, and Q. Cui. LncRNADisease: a database for long-non-coding RNA-associated diseases. Nucleic acids research, 41(D1):D983-D986, 2013. 4.J. Hayes, P. P. Peruzzi, and S. Lawler. MicroRNAs in cancer: biomarkers, functions and therapy. Trends in Molecular Medicine, 20(8):460 - 469, 2014. 5.A. Helwak, G. Kudla, T. Dudnakova, and D. Tollervey. Mapping the human miRNA interactome by CLASH reveals frequent noncanonical binding. Cell, 153(3):654-665, 2013. 6.S. Jalali, S. Kapoor, A. Sivadas, D. Bhartiya, and V. Scaria. Computational approaches towards understanding human long noncoding rna biology. Bioinformatics, 2015. 7.Q. Jiang, Y. Wang, Y. Hao, L. Juan, M. Teng, X. Zhang, M. Li, G. Wang, and Y. Liu. miR2Disease: a manually curated database for microRNA deregulation in human disease. Nucleic acids research, 37(suppl 1):D98-D104, 2009. 8.M.-T. Melissari and P. Grote. Roles for long non-coding RNAs in physiology and disease. Pflügers Archiv - European Journal of Physiology, 468(6):945-958, 2016. 9.G. Pio, M. Ceci, D. D'Elia, C. Loglisci, and D. Malerba. A Novel Biclustering Algorithm for the Discovery of Meaningful Biological Correlations between microRNAs and their Target Genes. BMC Bioinformatics, 14(S- 7):S8, 2013.
Multi-type clustering for the identification of lncRNA-disease relationships
Domenica D'Elia;
2016
Abstract
Introduction High-throughput sequencing technology, alongside new or improved computational methods, have been crucial for rapid advances in functional genomics. Among the most important results obtained thanks to the introduction of these new technologies, there is the discovery of thousands of non-coding RNAs (ncRNAs) whose function is pivotal for the fine-tuning of the expression of many genes that guide cell development, differentiation, apoptosis and proliferation [2]. Therefore, in the last decade, the number of papers reporting evidences about ncRNAs involvement in human complex diseases, such as cancer, is grown at an exponential rate. Among the different classes of ncRNAs, the most investigated one is that of microRNAs (miRNAs), which are small molecules (20-22nt long) that regulate the expression of genes through the modulation of the translation of their transcripts [4]. Much less is known about the functional involvement of long non-coding RNAs (lncRNAs), represented by RNA molecules longer than 200 nt, that have been recently discovered to have a plethora of regulatory functions spanning from chromatin modifications to post-transcriptional regulation [8]. However, the number of lncRNAs for which the functional characterization is available is still quite poor. Assessing the role and, especially, the molecular mechanisms underlining the involvement of lncRNAs in human diseases, is not a trivial task. Most of existing approaches are based on expensive experimental evaluations or on computational methods which exploit known/verified relationships among the lncRNA and the disease [6]. However, because of the complex functional interactions that lncRNAs can establish with other regulatory RNAs (i.e., miRNAs) or proteins, considering only the evidences of a direct relationship between lncRNAs and diseases may be very limiting. Some recent works started to consider further related information, but they do not consider possible dependencies among the relationships, but analyze single relationships independently. This corresponds to the assumption that all the instances follow the same probability distribution and that are independent to each other. In this case such assumption is easily violated, since different lncRNAs can be involved in the development of the same disease, as well as different diseases can be related to each other on the basis of the involvement of common lncRNAs or other regulatory entities such as miRNAs. To overcome these limitations we propose a computational method which is able to predict possibly unknown relationships between lncRNA and diseases by exploiting different in- formation about an heterogeneous set of (related) biological entities. In particular, we focus on lncRNAs, miRNAs, target genes and diseases, as well as on known relationships among these entities (see Figure 1). The proposed method is based on a clustering algorithm which is able to group objects of multiple types and to predict possibly unknown relationships on the basis of the extracted clusters. Moreover, the proposed algorithm is designed to identify highly cohesive, possibly overlapping and hierarchically orga- nized clusters, since i) the same lncRNA/disease can be involved in multiple networks of relationships and ii) as shown in [9], clusters at different levels of the hierarchy can describe more specific or more general relationships and cooperation activities. Methods In this section, we describe the solution we propose, which is based on three main steps: i) estimation of the strength of the relationships between lncRNAs and diseases; ii) identification of a hierarchy of possibly overlapping and hierarchically organized clusters of lncRNAs and diseases; iii) identification of possibly unknown lcRNA-disease relationships. In the following, we briefly describe each single step. Estimation of the strength of the relationships. The first step of the method consists in the iden- tification of the strength of the relationship for each lncRNA-disease pair. Such strength is represented by a score, which is computed as: s(li, dj ) = 1, if there exists a direct known relationship between the lncRNA li and the disease dj in the data, i.e. if there exists the relationship in the table disease_lcrna (see Figure 1). Otherwise s(li, dj ) = scoreP aths(li, dj ), where scorePaths(li, dj ) computes such score by considering all the alternative paths in the scheme, that are: olncrna - lncrna_target - target - disease_target - disease olncrna - lncrna_mirna - mirna - mirna_target - target - disease olncrna - lncrna_mirna - mirna - disease_mirna - disease In particular, scoreP aths(li, dj ) is computed as the maximum score obtained over all the paths. For each path P , the score is the cosine similarity computed over all the attributes of the entities involved in the path connecting li and dj . It is noteworthy that this score already gives an idea of the possibility that such relationships exist. However, we exploit them to build the hierarchy of clusters which is able to estimate the score more reliably, since based on the whole set of relationships. Construction of the hierarchy of clusters. Once each pair has been associated with a score, we build a cluster for each pair having a score greater than a given threshold ?. The goal is to identify the first level of the hierarchy consisting of clusters in form of cliques, where each lncRNA is associated to each disease in the cluster, with a score greater than ?. At this aim, we apply an iterative process which: sorts the clusters according to a quality criterion; for each cluster, finds the first cluster which can be merged with it preserving the clique constraint. The process stops when no further merging is possible, forming the first level of the hierarchy. The quality criterion we use, inspired by [9], is the cluster cohesiveness, which is the average score of the all the pairs that can be identified from the cluster. In order to build further hierarchical levels, we repeat the same process, by relaxing the clique constraint and by introducing a quality threshold ? on the cohesiveness of the cluster after merging. In details, the iterative process stops (and return a new hierarchical level), when no merging leading to a cluster with a cohesiveness greater than ? can be performed. Prediction of the relationships. Once we obtain the hierarchy of clusters, we identify possibly un- known relationships for each level of the hierarchy. In particular, the prediction is performed by assigning to each possible lncRNA-disease pair the score computed as the cohesiveness of the cluster in which it falls. When a lncRNA-disease pair appears in multiple clusters, we combine the cohesiveness of the set of clusters to obtain the final score. Baseline combination strategies can be the maximum, the minimum and the average. In this work, we propose a different combination function, which rewards those cases in which the pair appears in several highly cohesive clusters (indicating a higher probability of existence). Formally, given Cij = [C1, C2, . . . , Cn], the list of the clusters in which the lcRNA li and the disease dj appear, sorted in descending order with respect to their cohesiveness values w1, w2, . . . , wn, the score of the pair li, dj is computed as G(Cn), where: G(C1) = w1 G(Ck ) = G(Ck-1) + (1 - G(Ck-1) · wk In the experiments, we call this combination function custom and also evaluate its effectiveness with respect to the baseline combination approaches. Results The proposed method has been implemented in the system LP-MTRCLUS Figure 1. Relational schema of the database considered in the analysis and the results of the preliminary experiments, at three different levels of the hierarchy. A point in the graphs repre- sents the percentage of true relationships discovered (Y-axis) when we take a given number of returned interactions (X-axis). (Link Prediction through Multi-Type CLUStering). The dataset considered in this preliminary experiment (showed in Figure 1 as a relational database) has been built by integrating several existing biological datasets: interaction between lncRNAs and diseases and between lncRNAs and their target genes from [3]; interactions between miR- NAs and lncRNAs from [5]; interactions between diseases and genes from DisGeNET [1]; interactions between miRNAs and genes and interactions between miRNAs and diseases from miR2Disease [7]. The obtained relational database consists of 7.050 diseases, 507 lncRNAs, 508 miRNAs, 94.527 genes, 953 interactions between diseases and lncRNAs, 2.877 interactions between diseases and miRNAs, 26.522 interactions between diseases and genes, 70 interactions between lncRNAs and miRNAs, 252 interactions between lncRNAs and genes, and 803 interactions between miRNAs and genes. In this preliminary experiment, we performed a comparison between LP-MTRCLUS and HOCCLUS2 [9] which is a biclustering algorithm that also builds a hierarchy of clusters. The evaluation has been performed by applying the 10-fold cross validation on the set of known lncRNA-disease relationships. We averaged the results obtained according to the true positive rate measure, which is defined as TP R = TP where TP is the number of lncRNA-disease relationships that have been discovered and that have also been validated in literature, whereas FN is the number of known lncRNA-disease relationships that the considered system was not able to predict. Both TP and FN have been computed according to a threshold on the relationship score. By moving such threshold we obtain a curve where a point represents the percentage of true relationships discov- ered when we take a given number of returned interactions (see Figure 1). Following the results in [9], for both systems (LP-MTRCLUS and HOCCLUS2) we set the value of ? to 0.2 and ? = ? - 0.2. For space constraints, we limit the results to the first three levels of the hierarchical clustering. In Figure 1 it is possible to observe that LP-MTRCLUS is able to outperform HOCCLUS2 in all the hierarchical levels when we use the custom combination function. Such combination function is also able to outperform the other combination strategies whose performances are often worse than HOCCLUS2. As expected, the min strategy is the most conservative, whereas the max strategy shows a trend which is similar to HOCCLUS2. avg strategy is always in the middle between MIN and MAX strategies. As a final remark, it is noteworthy that HOCCLUS2 was able to obtain comparable performances because we provided it with set of lncRNA-disease scores computed by our system, since, in its original form, it is not able to analyze a complex relational database. For this reason, we are currently performing further experiments with other competitors to evaluate the effect of the integrated information on the results. Conclusions In this work, we focused on the recognized role of lncRNAs in human diseases. In particular, we pro- posed a computational method which is able to predict possibly unknown lncRNA-disease relationships by exploiting a clustering algorithm which work on multiple types of objects. Preliminary experiments showed that the proposed method, especially when adopting the proposed combination strategy, is able to outperform the algorithm HOCCLUS2. Currently we are performing additional experiments with other competitor approaches to deeply evaluate the effectiveness of the clustering-based method for this pur- pose as well as the effect of the exploitation of information about related biological entities, such as miRNAs, genes and their relationships with diseases and lncRNAs. References 1.A. Bauer-Mehren, M. Rautschka, F. Sanz, and L. I. Furlong. DisGeNET: a Cytoscape plugin to visualize, integrate, search and analyze gene-disease networks. Bioinformatics, 26(22):2924-2926, 2010. 2.T. Cech and J. Steitz. The Noncoding {RNA} Revolution--Trashing Old Rules to Forge New Ones. Cell, 157(1):77 - 94, 2014. 3.G. Chen, Z. Wang, D. Wang, C. Qiu, M. Liu, X. Chen, Q. Zhang, G. Yan, and Q. Cui. LncRNADisease: a database for long-non-coding RNA-associated diseases. Nucleic acids research, 41(D1):D983-D986, 2013. 4.J. Hayes, P. P. Peruzzi, and S. Lawler. MicroRNAs in cancer: biomarkers, functions and therapy. Trends in Molecular Medicine, 20(8):460 - 469, 2014. 5.A. Helwak, G. Kudla, T. Dudnakova, and D. Tollervey. Mapping the human miRNA interactome by CLASH reveals frequent noncanonical binding. Cell, 153(3):654-665, 2013. 6.S. Jalali, S. Kapoor, A. Sivadas, D. Bhartiya, and V. Scaria. Computational approaches towards understanding human long noncoding rna biology. Bioinformatics, 2015. 7.Q. Jiang, Y. Wang, Y. Hao, L. Juan, M. Teng, X. Zhang, M. Li, G. Wang, and Y. Liu. miR2Disease: a manually curated database for microRNA deregulation in human disease. Nucleic acids research, 37(suppl 1):D98-D104, 2009. 8.M.-T. Melissari and P. Grote. Roles for long non-coding RNAs in physiology and disease. Pflügers Archiv - European Journal of Physiology, 468(6):945-958, 2016. 9.G. Pio, M. Ceci, D. D'Elia, C. Loglisci, and D. Malerba. A Novel Biclustering Algorithm for the Discovery of Meaningful Biological Correlations between microRNAs and their Target Genes. BMC Bioinformatics, 14(S- 7):S8, 2013.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


