Taxonomic classification of genomic sequences is usually based on evolutionary distance obtained by alignment. In this work we introduce a novel alignment-free classification approach based on probabilistic topic modeling. Using a k-mer (small fragments of length k) decomposition of DNA sequences and the Latent Dirichlet Allocation algorithm, we built a classifier for 16S rRNA bacterial gene sequences. We tested our method with a tenfold cross validation procedure considering a bacteria dataset of 3000 elements belonging to the most numerous bacteria phyla: Actinobacteria, Firmicutes and Proteobacteria. Experiments were carried out using complete and 400 bp long 16S sequences, in order to test the robustness of the proposed methodology. Our results, in terms of precision scores and for different number of topics, ranges from 100 %, at class level, to 77 % at genus level, for both full and 400 bp length, considering k-mers of length 8. These results demonstrate the effectiveness of the proposed approach. © 2014 Springer International Publishing Switzerland.

Genomic sequence classification using probabilistic topic modeling

La Rosa Massimo;Fiannaca Antonino;Rizzo Riccardo;
2014

Abstract

Taxonomic classification of genomic sequences is usually based on evolutionary distance obtained by alignment. In this work we introduce a novel alignment-free classification approach based on probabilistic topic modeling. Using a k-mer (small fragments of length k) decomposition of DNA sequences and the Latent Dirichlet Allocation algorithm, we built a classifier for 16S rRNA bacterial gene sequences. We tested our method with a tenfold cross validation procedure considering a bacteria dataset of 3000 elements belonging to the most numerous bacteria phyla: Actinobacteria, Firmicutes and Proteobacteria. Experiments were carried out using complete and 400 bp long 16S sequences, in order to test the robustness of the proposed methodology. Our results, in terms of precision scores and for different number of topics, ranges from 100 %, at class level, to 77 % at genus level, for both full and 400 bp length, considering k-mers of length 8. These results demonstrate the effectiveness of the proposed approach. © 2014 Springer International Publishing Switzerland.
2014
Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
9783319090412
16S rRNA
Alignment-free analysis
DNA k-mers
Genomic classification
LDA
Topic modeling
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/275669
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact