Integrative analyses using data from Public Genomic Datasets allowed GWAS investigation of the genetic variants function providing more insight than single-platform approaches. There exists a large volume of literature in the area of integrative genomics methodologies and there has been much progress, however the potential of the integrated genomic analyses remains largely unexploited because of the heterogeneous nature of the data. The aim of this work is to provide the researchers with a workflow based on the use of GenoMetric Query Language (GMQL) system for querying and integrating various public and private genomic datasets. GMQL is a novel highthroughput computational software tool that allows expressing queries easily over genomic regions and their metadata likewise the Relational Algebra and Structured Query Language (SQL) over a relational database. Moreover, GMQL allows to combine private cancer datasets (i.e. datasets created by a specific user as result of their own experiments and studies) with datasets of genomic features and biological/clinical metadata sourcing from Public Genomic Datasets (some of which already available in the GMQL Repository such as ENCODE, Roadmap Epigenomics, TCGA, as well as annotations from GENCODE and RefSeq). GMQL system is equipped with a web-based interface with the goal of providing a user-friendly environment for bioinformaticians and biologists who need to build queries more intuitively in the GenoMetric Query Language. This work presents the design and application of a data analysis workflow built using GMQL that integrates the public available datasets contained in GMQL with SNPs (single-nucleotide polymorphisms) private datasets generated using the Affymetrix DMET Plus platform, a pharmacogenomic drug metabolism multi-gene platform. By detecting SNPs on genes related to drug metabolism, the DMET platform can identify the relationship among the patients' genomic variations and drug metabolism. In the presented case study, some queries combining public and private data are presented and we identify only the SNPs overlapping with high expressed genes (i.e. expression level is above a given threshold).
Using GMQL to Build Workflows for Querying, Downloading and Integrating Public with Private Genomic Datasets
2019
Abstract
Integrative analyses using data from Public Genomic Datasets allowed GWAS investigation of the genetic variants function providing more insight than single-platform approaches. There exists a large volume of literature in the area of integrative genomics methodologies and there has been much progress, however the potential of the integrated genomic analyses remains largely unexploited because of the heterogeneous nature of the data. The aim of this work is to provide the researchers with a workflow based on the use of GenoMetric Query Language (GMQL) system for querying and integrating various public and private genomic datasets. GMQL is a novel highthroughput computational software tool that allows expressing queries easily over genomic regions and their metadata likewise the Relational Algebra and Structured Query Language (SQL) over a relational database. Moreover, GMQL allows to combine private cancer datasets (i.e. datasets created by a specific user as result of their own experiments and studies) with datasets of genomic features and biological/clinical metadata sourcing from Public Genomic Datasets (some of which already available in the GMQL Repository such as ENCODE, Roadmap Epigenomics, TCGA, as well as annotations from GENCODE and RefSeq). GMQL system is equipped with a web-based interface with the goal of providing a user-friendly environment for bioinformaticians and biologists who need to build queries more intuitively in the GenoMetric Query Language. This work presents the design and application of a data analysis workflow built using GMQL that integrates the public available datasets contained in GMQL with SNPs (single-nucleotide polymorphisms) private datasets generated using the Affymetrix DMET Plus platform, a pharmacogenomic drug metabolism multi-gene platform. By detecting SNPs on genes related to drug metabolism, the DMET platform can identify the relationship among the patients' genomic variations and drug metabolism. In the presented case study, some queries combining public and private data are presented and we identify only the SNPs overlapping with high expressed genes (i.e. expression level is above a given threshold).I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.