Identification of disease modules through network diffusion and integration of genome-wide datasets.

Mosca, Ettore; Bersanelli, Matteo; Castellani, Gastone; Milanesi, Luciano

MOTIVATION. Integrative analyses of multi-layer datasets are required to draw a more comprehensive view of biological processes and the development of tools for the integration of different layers of biological information is one of the major challenges for computational scientists. Molecular interactions data, even if are still incomplete, provide a useful framework for integrative analysis of genome-wide data, and precious insights on molecular mechanisms that can be modulated by therapeutic strategies [Berger et al. 2013]. We describe a computational approach that take advantage of diffusion processes on networks in order to highlight disease modules from the integrative analysis of multiple omics datasets. METHODS. We use the network propagation (NP) algorithm [Vanunu et al. 2010] for smoothing genome-wide data of each sample. NP simulates a random walk with restarts on a network and can be seen as the discrete form of an open "source-sink" dynamical system in which particles of fluid diffuse throughout the network. After convergence, the stationary distribution of fluid remaining on the network depends mainly on initial distribution and network topology. We define pathway scores as the sum of the smoothed information detected in pathway components and scale pathway scores through a robust standardization that enables the direct comparison among pathways of different size (from 5 to 500 elements). We calculate two-classes statistics for genes and pathways using SAM (significance analysis of microarrays) [Tusher et al. 2001]. We use the Jaccard coefficient to define the similarity between pathways. We extract significant subnetworks from a graph (weighted with statistics) using search heuristics based on the calculation of the minimum-spanning-tree and genetic algorithms [Mosca et al. 2013]. We use the collection of pathways provided by the NCBI Biosystems database and protein interaction data from STRING, NCBI Interactions, the human interactome project and predictions by FPCLASS. RESULTS. We used network propagation to diffuse genome-wide information on the basis of the protein interaction network. This analysis redefines genome-wide statistics on the basis of the considered topology and calculates the regions of the network highly influenced by the most significant variations. We used a simulated dataset to study the possible initial distributions of information on nodes of a scale-free graph that lead to network smoothed statistics well separated from what is expected from random initial distributions. Initial conditions enriched with hubs and neighbouring proteins lead to significant steady states that enables the identification of disease modules. We used network smoothed scores, obtained from network smoothing of single nucleotide variations and differential expression, to calculate gene-wise and pathway-based differential scores between two classes of samples. We extracted the sub-network of genes and sub-network of pathways with the highest scoring statistics. The former summarizes the module of genes that represent the region of the interactome that is highly affected by molecular variations, while the latter provides a "higher-level" view of the relations among the biological processes regulated by the gene module. Lastly, we defined an integrative disease module that combines the finding in each of the two omics layers, mapping each module on the same protein interaction network. We repeated the network-based analysis using four protein interaction networks. We found genes and pathways shared by multiple interactomes, but also interactome-specific results, which reflect the significant differences among the available protein interaction networks. Preliminary results show that our network-based strategy identified a module of genes involved in biological processes that regulate lipid metabolism, when applied for studying the genetic variations and gene expression differences in obese and normal individuals. REFERENCES. Berger B, Peng J, Singh M. Computational solutions for omics data. Nature reviews Genetics. 2013;14(5):333-346. doi:10.1038/nrg3433. Mosca E, Milanesi L (2013) Network-based analysis of omics with multi-objective optimization. Molecular BioSystems 9, 2971-2980. Tusher VG, Tibshirani R, Chu G (2001) Significance analysis of microarrays applied to the ionizing radiation response. PNAS 98 (9) 5116-5121. Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan R (2010) Associating Genes and Protein Complexes with Disease via Network Propagation. PLoS Comput Biol 6(1): e1000641.