Metagenomic analysis by bacterial short reads classification with deep network model

Fiannaca, A; La Rosa, M; La Paglia, L; Lo Bosco, G; Gaglio, S; Rizzo, R; Urso, A

Motivation: Metagenome represents the genetic composition of many individual organisms belonging to a specific environmental sample. Metagenomic analysis, allows to characterize the bacterial community composition of the selected sample, without the necessity to isolate single bacteria species and to use cell cultures. During the last decades, a lot of studies report a central role played by microbial communities to influence the development of a particular pathological condition in humans, in terms of kind and percentage of bacteria placed in a particular district of human body. Thus, microbiome analysis composition acquires a diagnostic and prognostic power. The most widely used marker gene for metagenomic analysis is 16S rRNA gene; it can be divided in 9 hypervariable (V1-V9) and 9 conserved regions. Two next generation sequencing technologies are mainly used: the whole genome shotgun (WGS) and the amplicon (AMP) sequencing technique; the latter method is based on the evidence that some hypervariable regions or a combination of few of them have a good informative power in terms of phylogenetic studies. For each of them, specific bioinformatics tools have been developed in order to quickly analyze hundreds of different bacterial species with a high rate of sensitivity. Since both sequencing technologies can be used alternatively, according to the kind of experiment and budget availabilities, we propose a deep learning method for bacterial taxonomic classification (until genus level) of metagenomics data, that can be applied in both type of technologies. Methods: The proposed pipeline, allows to classify short-reads coming from both WGS shotgun and AMP techniques. According to Illumina MiSeq v3 sequencer, we consider V3-V4 sub-regions, that offer a deep assessment of population diversity. Starting from a set of taxonomical pre-classified 16S gene short-reads belonging to a mixture of different bacteria, we propose a k-mers representation of these short-reads. This representation can define a coordinate space at 4k dimensions where it is possible to compute distance measures among genomic sequences. Moreover, it gives a good trade-off between a manageable computational complexity and the information content. The k-mer representation is the input of a deep learning network. In this work we adopted a deep belief network (DBN), in order to validate an auto-encoder approach for DNA clasification. A DBN is a stack of at least two Restriceted Boltzmann machines (RBM) that are used for dimensionality reduction, classification, regression, feature learning. The training of a DBN is composed of two phases. In the first phase, called pre-training, the network accomplishes dimensionality reduction and feature extraction, then in the second phase, called fine-tuning, the network is unfolded and it is trained using back-propagation in a supervised way. The output of the training pipeline is a deep learning model for each taxa, for classification purpose. Results: To test the proposed pipeline, we used Grinder tool to simulate both shotgun and amplicon short-reads, from 1000 16S full-length sequences belonging to 100 different genera. Then, we investigated for the best k-mer size able to represent datasets. Finally, we tested our proposed methodology by means of a tenfold cross validation scheme in order to find the best configuration in terms of DBN parameters and we compared the best results we obtained against the classification performances provided by the reference classifier for bacteria identification, that is the RDP classifier based on a naive Bayes classifier. Figure 1 shows obtained results in terms of accuracy, precision, recall and F1. Using k = 7, and 256 hidden units in both RBM layers, we outperformed the RDP classifier at each taxonomic level. For instance, at genus level we reached 91.4% and

CNR Institutional Research Information System