Metagenomics is the study of genomic sequences in a heterogeneous microbial sample taken, e.g., from soil, water, human microbiome and skin. One of the primary objectives of metagenomic studies is to assign a taxonomic identity to each read sequenced from a sample and then to estimate the abundance of the known clades. With ever-increasing metagenomic datasets obtained from high-throughput sequencing technologies readily available nowadays, several fast and accurate methods have been developed that can work with reasonable computing requirements. Here we provide an overview of the state-of-theart methods for the classification of metagenomic sequences, especially highlighting theoretical factors that seem to correlate well with practical factors, and could therefore be useful in the choice or development of a new method in experimental contexts. In particular, we emphasize that the information derived from the known genomes and eventually used in the learning and classification processes may create several experimental issues--mostly based on the amount of information used in the processes and its uniqueness, significance, and redundancy,--and some of these issues are intrinsic both in current alignment-based approaches and in compositional ones. This entails the need to develop efficient alignmentfree methods that overcome such problems by combining the learning and classification processes in a single framework.

Theoretical and Practical Analyses in Metagenomic Sequence Classification

Davide Verzotto
2019

Abstract

Metagenomics is the study of genomic sequences in a heterogeneous microbial sample taken, e.g., from soil, water, human microbiome and skin. One of the primary objectives of metagenomic studies is to assign a taxonomic identity to each read sequenced from a sample and then to estimate the abundance of the known clades. With ever-increasing metagenomic datasets obtained from high-throughput sequencing technologies readily available nowadays, several fast and accurate methods have been developed that can work with reasonable computing requirements. Here we provide an overview of the state-of-theart methods for the classification of metagenomic sequences, especially highlighting theoretical factors that seem to correlate well with practical factors, and could therefore be useful in the choice or development of a new method in experimental contexts. In particular, we emphasize that the information derived from the known genomes and eventually used in the learning and classification processes may create several experimental issues--mostly based on the amount of information used in the processes and its uniqueness, significance, and redundancy,--and some of these issues are intrinsic both in current alignment-based approaches and in compositional ones. This entails the need to develop efficient alignmentfree methods that overcome such problems by combining the learning and classification processes in a single framework.
2019
Istituto di informatica e telematica - IIT
9783030276836
Metagenomic sequence classification
Alignment-free algorithms
Genome analysis
Combinatorics
Pattern discovery
Strings
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/388068
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact