Methods for Integrating Multi-omics Data: Mathematical Aspects.

Bersanelli, Matteo; Mosca, Ettore; Castellani, Gastone; Milanesi, Luciano

Recently, there has been a growing interest in methods that involve integrated analysis of multi-omics data. Deducing a valuable biological or medical meaning from a multi-omics dataset is a non-trivial problem and, thus, the integrated analysis of these datasets, combined with other phenotypic or clinical data, is still not completely matched by current computational methods, especially when the analysis should result not only in a list of markers, but also highlight the interactions between biological elements under investigation. We review the most advanced approaches for integrating multi-omics data sets, trying to underline the most used mathematical techniques, concepts that are often in the background of each specific method, assumptions and open issues. Mathematically, the problem can be formulated as the joint analysis of multiple "biological components-by-sample" matrices of different sizes, possibly using other matrices containing prior information, usually on biological components. Quite a few methods proposed in the last years extend established methods for single matrix to the simultaneous analysis of multiple matrices, in order to find common clusters and relationships among variables. Methods such as iCluster and Multiple Data Integration have a strong and sophisticated Bayesian structure, which allows the introduction of known probabilistic distributions in order to estimate meaningful parameters or variables from the data; Paradigm also relies on a Bayesian model whose goal is to identify relevant biological pathways and detect their importance in different patients. Network-based approaches use currently known (e.g. protein-protein interactions) or predicted relationships between biological variables as a constraint for the analysis. Furthermore, graph measures (degree, connectivity, centrality) and subnetwork extracting strategies are used to identify valuable biological information as clusters (communities) of samples. For example, SteinerNet establishes a framework for integrating omics data sets searching for the solution of the prize-collecting Steiner tree problem in order to reconstruct biologically meaningful signaling networks. Multi-objective optimization has been applied for the extraction of subnetworks enriched in multi-omics information. Network Based Stratification (NBS) and Similarity Network Fusion (SNF) find clusters by minimizing a function of the data involving information coming from the network structure (e.g. using the adjacency matrix or the Laplacian matrix properly normalized): NBS smooths mutation profiles and clusters patients using generalized non-negative matrix factorization; SNF integrates patient-patient networks in order to find clusters by means of spectral clustering. Multiplex (multiple networks defined on the same set of vertexes) have been proposed to model and simultaneously analyze several datasets as a multi-layered network. In conclusion, considering the state of the art of the few recent years, graphs and bayesian statistics are two of the most promising frameworks for developing methods for multi-omics data analysis.