Motivation: In the era of Big Data, a huge amount of biological data related to different entities, such as proteins, genes, non-coding RNA, diseases, functional associations, has been made available. These resources are typically stored in several bioinformatics databases, each one implementing its own data model and user interface. However, in many bioinformatics scenarios there is often the need to use more than one resource. For a bioinformatician that implies a further effort in terms of ability to skip form one service to another one (with different interface); waste of working time for transferring data and intermediate results from one resource to another one, sometimes dealing with aliases and accession ID disambiguation. The availability of a single bioinformatics platform that integrates many biological resources and services is, for those reasons, a fundamental issue. Methods: We present BioGraphDB, an integrated database that collects and links heterogeneous bioinformatics resources. BioGraphDB is implemented using a NoSQL graph database built upon the OrientDB platform. Graph databases allow, in fact, a greater scalability and queries efficiency with regards to the size of data, rather than traditional SQL database. At the moment, BioGraphDB integrates ten bioinformatics resources: miRBase, representing the most complete repository for microRNA (miRNA) sequences and annotations; UniProtKB, that is the largest public collection of proteins and their functional annotation; Gene Ontology (GO), that stores annotations about biological processes, cellular components and molecular functions; Reactome, that contains validated metabolic pathways; Entrez Gene, collecting a wide set of information about genes; RefSeq, that is a collection of manually curated genomic, proteomic and transcriptomic sequences; Pubmed, the online resource for biomedical scientific publications; mirCancer, providing associations between deregulated miRNA and cancer diseases; HGNC, hosting the official nomenclature for genes, including synonyms and ID conversion among biobanks; miRNA-target interactions, that contains both predicted and validated interactions between miRNA and target genes. Each component database has been downloaded from its original site and it has been processed using customized Extract-Transformer-Loader (ETL) blocks, written in Java, that allow to convert the original data format of the selected resources into a graph structure and to link them according to their common entities. A detailed database scheme of BioGraphDB is shown in Figure 1. Results: BioGraphDB offers the possibility to face complex bioinformatics scenarios using a single platform, thanks to the integration of different data sources. For example it is possible to identify and analyze tumour-suppressor or oncogenic miRNA. In this scenario, starting from a specific metabolic pathway, first of all, it is possible to select a group of proteins involved in that pathway using Reactome. This set of proteins can successively be analyzed through the use of miRNA-target interaction tools. Indeed, this resource allow the identification of putative or validated miRNAs that are targets of those protein set. The obtained list of target is then the input of other web services as miRCancer, to link one or more miRNAs to a specific disease. In particular this tool offer information about the over- or under-expression status of miRNAs objects of the study, allowing to evidence the relationship between specific miRNAs up or down regulated in different types of cancer. All the above steps can be accomplished using only our BioGraphDB. Its graph structure, indeed, allows to use an ad-hoc query language, called Gremlin, that let the user, by means of simple commands, to replicate every processing step without concerns about interoperability among different resources.

BioGraphDB: a graph database based on integration of publicly available bioinformatics resources

Antonino Fiannaca;Massimo La Rosa;Laura La Paglia;Antonio Messina;Riccardo Rizzo;Alfonso Urso
2016

Abstract

Motivation: In the era of Big Data, a huge amount of biological data related to different entities, such as proteins, genes, non-coding RNA, diseases, functional associations, has been made available. These resources are typically stored in several bioinformatics databases, each one implementing its own data model and user interface. However, in many bioinformatics scenarios there is often the need to use more than one resource. For a bioinformatician that implies a further effort in terms of ability to skip form one service to another one (with different interface); waste of working time for transferring data and intermediate results from one resource to another one, sometimes dealing with aliases and accession ID disambiguation. The availability of a single bioinformatics platform that integrates many biological resources and services is, for those reasons, a fundamental issue. Methods: We present BioGraphDB, an integrated database that collects and links heterogeneous bioinformatics resources. BioGraphDB is implemented using a NoSQL graph database built upon the OrientDB platform. Graph databases allow, in fact, a greater scalability and queries efficiency with regards to the size of data, rather than traditional SQL database. At the moment, BioGraphDB integrates ten bioinformatics resources: miRBase, representing the most complete repository for microRNA (miRNA) sequences and annotations; UniProtKB, that is the largest public collection of proteins and their functional annotation; Gene Ontology (GO), that stores annotations about biological processes, cellular components and molecular functions; Reactome, that contains validated metabolic pathways; Entrez Gene, collecting a wide set of information about genes; RefSeq, that is a collection of manually curated genomic, proteomic and transcriptomic sequences; Pubmed, the online resource for biomedical scientific publications; mirCancer, providing associations between deregulated miRNA and cancer diseases; HGNC, hosting the official nomenclature for genes, including synonyms and ID conversion among biobanks; miRNA-target interactions, that contains both predicted and validated interactions between miRNA and target genes. Each component database has been downloaded from its original site and it has been processed using customized Extract-Transformer-Loader (ETL) blocks, written in Java, that allow to convert the original data format of the selected resources into a graph structure and to link them according to their common entities. A detailed database scheme of BioGraphDB is shown in Figure 1. Results: BioGraphDB offers the possibility to face complex bioinformatics scenarios using a single platform, thanks to the integration of different data sources. For example it is possible to identify and analyze tumour-suppressor or oncogenic miRNA. In this scenario, starting from a specific metabolic pathway, first of all, it is possible to select a group of proteins involved in that pathway using Reactome. This set of proteins can successively be analyzed through the use of miRNA-target interaction tools. Indeed, this resource allow the identification of putative or validated miRNAs that are targets of those protein set. The obtained list of target is then the input of other web services as miRCancer, to link one or more miRNAs to a specific disease. In particular this tool offer information about the over- or under-expression status of miRNAs objects of the study, allowing to evidence the relationship between specific miRNAs up or down regulated in different types of cancer. All the above steps can be accomplished using only our BioGraphDB. Its graph structure, indeed, allows to use an ad-hoc query language, called Gremlin, that let the user, by means of simple commands, to replicate every processing step without concerns about interoperability among different resources.
2016
Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
graph database
data integration
microRNA
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/321169
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact