Metagenomics is the study of genomic sequences in a heterogeneous microbial sample taken, e.g., from soil, water, human microbiome and skin. One of the primary objectives of metagenomic studies is to assign a taxonomic identity to each read sequenced from a sample and then to estimate the abundance of the known clades. With ever-increasing metagenomic datasets obtained from high-throughput sequencing technologies readily available nowadays, several fast and accurate methods have been developed that can work with reasonable computing requirements. Here we provide an overview of the state-of-theart methods for the classification of metagenomic sequences, especially highlighting theoretical factors that seem to correlate well with practical factors, and could therefore be useful in the choice or development of a new method in experimental contexts. In particular, we emphasize that the information derived from the known genomes and eventually used in the learning and classification processes may create several experimental issues--mostly based on the amount of information used in the processes and its uniqueness, significance, and redundancy,--and some of these issues are intrinsic both in current alignment-based approaches and in compositional ones. This entails the need to develop efficient alignmentfree methods that overcome such problems by combining the learning and classification processes in a single framework.
Theoretical and Practical Analyses in Metagenomic Sequence Classification
Davide Verzotto
2019
Abstract
Metagenomics is the study of genomic sequences in a heterogeneous microbial sample taken, e.g., from soil, water, human microbiome and skin. One of the primary objectives of metagenomic studies is to assign a taxonomic identity to each read sequenced from a sample and then to estimate the abundance of the known clades. With ever-increasing metagenomic datasets obtained from high-throughput sequencing technologies readily available nowadays, several fast and accurate methods have been developed that can work with reasonable computing requirements. Here we provide an overview of the state-of-theart methods for the classification of metagenomic sequences, especially highlighting theoretical factors that seem to correlate well with practical factors, and could therefore be useful in the choice or development of a new method in experimental contexts. In particular, we emphasize that the information derived from the known genomes and eventually used in the learning and classification processes may create several experimental issues--mostly based on the amount of information used in the processes and its uniqueness, significance, and redundancy,--and some of these issues are intrinsic both in current alignment-based approaches and in compositional ones. This entails the need to develop efficient alignmentfree methods that overcome such problems by combining the learning and classification processes in a single framework.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.