A standard-based HPC platform for genome annotation, analysis and visualization

Manconi, A; Moscatelli, M; Gnocchi, M; Milanesi, L

Introduction Advances in next-generation sequencing technologies is facilitating the sequencing and de-novo assembly of genomes from different eukaryotic and prokaryotic species. The growing number of fully assembled sequences is providing new opportunities to study the genome of different species. In addition, the availability of genome sequences of related species provides valuable opportunities to compare them with the aim to study their evolution. In this context, researchers are increasingly relying on comparative genomics to explore the genomic signals that control gene function across many species with the aim to better understand the structure and function of genes. This information helps researchers to develop new approaches for treating of diseases. To perform these analyses researchers need to use a variety of tools to annotate, analyze, explore, and compare genome sequences. Specialized tools and with different features have been proposed to the scientific community. However, the use of these tools may be difficult for biologists. In general, to facilitate the work of researchers, these tools should adopt common standards to ensure interoperability among them [1]. However, tools and analytical methods do not always use common standards making difficult the interoperability. Moreover, researchers may need to replicate scientific analyses presented in a paper, as well as to apply modifications, and to update data according to their specific needs. This may involve the analysis of huge amount of data, posing new computational challenges. Another problem is related to the sharing of data. In fact, typically, each tool generates and stores data using formats that are incompatible with the other ones. To this end, researchers involved in these analyses should have access to a single platform designed with the aim to ensure direct interoperability among the tools while supporting automated computation. According to these considerations, we implemented a standard-based HPC platform aimed at providing support at researchers involved in the tasks of annotating, analyzing, and visualizing eukaryotic and prokaryotic genomes. Methods The platform has been conceived with the aim to provide support for storage, annotation and analysis of both eukaryotic and prokaryotic genomes. To this end, it has been built on a hardware infrastructure intended to enable big-data classes of applications which consists of a massive storage platform of 1.6 PB, 2040 CPU cores, 16 NVIDIA K20 GPUs, and 2 big memory nodes (i.e., 1 node equipped with 1TB and 1 node equipped with 512GB of memory). Most of the tools integrated in the platform are components of the Generic Model Organism Database (GMOD) project [2]. GMOD is a collection of interconnected tools and databases widely used by the scientific community that includes several components for managing, annotating and visualizing genomic data. Many of the GMOD components are mature tools with several years of development and testing driven by diverse groups of developers, scientists, and laboratories that use and/or improve these components every day. In particular, the platform supports the following GMOD components: Maker [3][4], Chado[5], WebApollo [6], Gbrowse [7][8], SynView [9], and Galaxy [10]. Moreover, the platform also provides support to the (not GMOD component) GA4GH API [11] released by the Global Alliance for Genomics and Health [12] to share genomic data. In the following of this section, the above tools are briefly described. ? Maker is an annotation pipeline that can be used for de-novo annotation of genomes as well as for updating of existing annotations with the aim to reflect new evidence. It should be pointed out that Maker is MPI-capable for rapid parallelization across computer clusters. This feature makes it also suitable to annotate large genomes. Its annotation pipeline identifies repeats, aligns ESTs and proteins to a genome, produces ab-initio gene predictions and automatically synthesizes these data into gene annotations with evidencebased quality values. Maker outputs annotations in the standard GFF3 format that can be directly loaded into relational databases as Chado and genome browsers that adhere to the GMOD standards. ? Chado is a relational database schema that underlies many GMOD installations. It is used to represent different type of biological data. It has been designed to handle biological knowledge related to sequence, sequence comparisons, phenotypes, genotypes, ontologies, publications, and phylogeny. ? WebApollo is a web-based manual annotation environment for distributed community. It allows multiple users to annotate parts of the same sequence concurrently. To this end, any change made by other users is notified in the user's browser window. WebApollo keeps tracks of all changes made that can be approved or rejected by an administrator. ? GBrowse is a powerful and mature genome browser. It supports most of genome browser features, including qualitative and quantitative tracks, track uploading, track sharing, track downloading, and interactive track configuration. ? SynView is a interactive and customizable comparative genomic visualization tool based on GBrowse. It is able to display both the genomes comparison and the associated functional annotations in the same working environment. ? Galaxy is an open web-based scientific workflow system for data intensive biomedical research easily accessible to researchers that do not have programming experience. It provides an environment that helps researchers to run bioinformatics analysis tools as well as ad-hoc defined workflows. ? GA4GH API have been defined with the aim to allow a interoperable exchange of genomic information among multiple organizations and on multiple platforms. This is a freely available open standard for interoperability, that uses common web protocols to support serving and sharing of data on DNA sequences, genomic annotations and genomic variations. The API allow to create a data source that can be easily integrated in genomic analysis pipeline as well as integrated into specialized genomics platforms. Results The above tools have been properly installed and configured to work in the platform. The platform provides different access points. Authorized users can access to Maker to annotate their sequences as well as for updating of existing ones. Annotation data obtained with Maker are loaded into a Chado database and integrated into both WebApollo and GBrowse. Users can access to WebApollo for manual annotation, and to GBrowse to explore the annotated sequences as well as to compare genome sequences with SynView. GBrowse has been configured to show different tracks to explore alignments, protein-coding genes, CDS, mRNA, and other regions, as well as the GC content of the analyzed sequences. Users can customize the tracks as well as download them or upload other tracks for visualization. Moreover, GBrowse has also been configured to send data to Galaxy. Using this feature, users can visualize selected tracks on a given part of a sequence, and with a single click can send these data to Galaxy for analysis. As for Galaxy, it has been configured to run jobs on the previously described hardware infrastructure and to provide support for both CPU- and GPU-based tools [13]. It should be pointed out that bioinformatics is exploring new computational approaches based on the use of hardware accelerators such as the GPUs. Use of GPUs over the last years has resulted in significant increases in the performance of certain applications. Despite GPUs are increasingly used most laboratories do not have access to a GPU cluster or server. In this context, it is very important to provide useful services to use these tools with the aim to ensure reproducibility of specific analyses. It should be observed that Galaxy supports different distributed resource managers with the aim to enable different clusters. In our opinion, SLURM [14] represents the most suitable workload manager to manage and control jobs. SLURM is a highly configurable workload and resource manager and it is currently used on six of the ten most powerful computers in the world including the Piz Daint, utilizing over 5000 NVIDIA Tesla K20 GPUs. Users can interactively perform and refine their analyses using Galaxy. In particular, users can send or upload data into Galaxy from the other access points of the platform, analyze them, and store the results ignoring the implementation details of the underlying computing infrastructure. A GA4GH server has also been installed with the aim to share and make easy to integrate annotation data into external genomic analysis pipelines. It should be pointed out that the GA4GH API allows to retrieve genomic data according to different criteria that can be combined to meet the needs of the researcher. Then, to facilitate the work of researchers, specialized tools have been added to Galaxy to retrieve annotation data from the GA4GH server. Currently, the platform is used to support some initiatives of the InterOmics Project [15] aimed at studying human cancer, microbiome, and some plant genomes. Conclusions We presented a dedicated platform for genome annotation and analysis. The platform has been implemented with the aim to help researchers to store, annotate, explore, and analyze genomic data. The main goal that has driven the design of the platform has been to facilitate analysis and computational reproducibility. To this end, the platform integrates interconnected tools that use common standards while ensuring automated computation on a robust hardware infrastructure that support both CPU- and GPU-based computation. The GA4Gh open standard has been used to share genomic data. The platform is intended to be dynamic, therefore it will be integrated with other specialized software solutions that will be useful for the genome analysis. Currently, we are working to integrate other specialized comparative genomics tools of the GMOD project as CMAP [16] and SynBrowse [17].