Pre-processing of high-throughput sequencing data

Manconi, A; Moscatelli, M; Gnocchi, M; Milanesi, L

Motivation NGS has revolutionized the genomic research. Technological advances have reduced the sequencing costs while have notably increased the sequence throughput. Typically, artifacts of different nature that arise during the sequencing process affect NGS data. As these artifacts may influence the downstream analyses, data quality control (QC) becomes mandatory. Different QC tools have been proposed. Without claiming to be exhaustive, let us cite some of the most popular tools, i.e., FASTQC [1], FASTX-Toolkit [2], NGS QC Toolkit [3], PRINSEQ [4]. FASTQC is a widely adopted tool for quality assessment. It provides a set of analyses to assess the raw data according to multiple aspects. It also provides a GUI to report all analyses highlighting problems in the data. FASTX-Toolkit is a collection of command line tools for FASTA/FASTQ file processing including some quality statistics. It also provides tools for quality filtering/trimming. NGS QC Toolkit is a standalone tool for QC that includes tools for sequence trimming and statistics calculation. It is also equipped with modules to generate statistics in graphical format. PRINSEQ is a web-based and standalone tool that can be used to generate summary statistics of sequence and quality data and to filter and trim sequences. It should be pointed out that the massive amounts of generated sequences make QC computationally intensive. Despite that, only some tools implement a multicore processing strategy to deal with that computational challenge. In our opinion, even though these tools implement very useful features, their implementation does not permit to efficiently analyze large amounts of NGS data. We deem that QC tasks can be efficiently parallelized on manycore architectures as GPUs. GPUs are devices equipped with hundreds of cores able to handle thousands of threads simultaneously, so that a very high level of parallelism can be reached. In this work, we present G-FastQC (GPU Fast Quality Control) a GPU-based tool for quality assessment, filtering, and trimming of NGS data. Methods G-FastQC has been devised to be massively parallelized on NVIDIA GPUs. It supports single- and paired-end libraries generated with Illumina platforms. G-FastQC implements a set of analyses to perform QC checks on raw data. These analyses allow to assess the data according to aspects related to the quality scores and content of the sequences. In particular, G-FastQC allows to calculate the: average quality values across all bases at each position; quality score distribution over all sequences; GC content across the whole length of each sequence; GC content across all bases; sequence content across all bases at each position.To help users to analyze the data quality, G-FastQC has been integrated with an interactive web-based interface built using Shiny[5] that allows to plot graphs of the performed analyses.G-FastQC also supports quality filtering and trimming. As for quality filtering, G-FastQC allows to filter sequences based on the amount of both low quality and N nucleotides as well as to filter sequences based on the GC content. It can also be used to mask the nucleotides with a quality score lower than a given threshold. As for trimming, G-FastQC implements an operator based on a sliding window approach. It analyzes the amount of low quality nucleotides in a sliding window. When the amount of these nucleotides is higher than a given threshold, G-FastQC trims all nucleotides from the start of the window to the 3'. Trimmed sequences with length lower than a given threshold can be automatically discarded. The same sliding window approach has been used to implement an operator aimed at masking these nucleotides rather than trimming them. It should be pointed out, that unlike the above-mentioned tools, G-FastQC has been designed to perform analysis of several datasets in a single run. Results Experiments carried out on a 12 cores Intel Xeon CPU E5-2667 2.90 GHz and an NVIDIA Tesla k20c show that G-FastQC outperforms notably the other tools in terms of computing time. For instance, in the task of filtering low-quality sequences, it has been 21.4x/28.3x faster than FASTX-Toolkit/NGS QC Toolkit analyzing a dataset consisting of 50M of 100 bp reads. Similar results have been obtained comparing the performance of G-FastQC with those of the other tools to generate the quality reports, and in the tasks of filtering/trimming the raw data. References 1. http://www.bioinformatics.abraham.ac.uk/projects/fastqc/ 2. http://hannonlab.cshl.edu/fastx_toolkit 3. Patel RK, Jain M (2012) NGS QC Toolkit: a toolkit for quality control of next generation sequencing data, PloS one, 7(2), e30619. 4. Schmieder R, Robert E (2011) Quality control and preprocessing of metagenomic datasets. Bioinformatics, 27(6), 863-864. 5. https://shiny.rstudio.com