In this work we introduce a novel alignment-free genomic classification approach based on probabilistic topic modeling. Using a kmer (small fragments of length k) decomposition of DNA sequences and the LDA algorithm, we built a classifier for 16S rRNA bacterial gene sequences. We tested our method with a ten-fold cross validation procedure considering a bacteria dataset of 3000 elements belonging to the most numerous bacteria phyla: Actinobacteria, Firmicutes and Proteobacteria. Our results, in terms of precision scores and for different number of topics, ranges from 100%, at class level, to 77% at genus level, considering k-mers of length 8. These results demonstrate the effectiveness of our approach and, as future work, we are going to tune our methodology to improve classification results at genus level, implementing a consensus mechanism.

Genomic Sequence Classification using Probabilistic Topic Modeling

Massimo La Rosa;Antonino Fiannaca;Riccardo Rizzo;Alfonso Urso
2013

Abstract

In this work we introduce a novel alignment-free genomic classification approach based on probabilistic topic modeling. Using a kmer (small fragments of length k) decomposition of DNA sequences and the LDA algorithm, we built a classifier for 16S rRNA bacterial gene sequences. We tested our method with a ten-fold cross validation procedure considering a bacteria dataset of 3000 elements belonging to the most numerous bacteria phyla: Actinobacteria, Firmicutes and Proteobacteria. Our results, in terms of precision scores and for different number of topics, ranges from 100%, at class level, to 77% at genus level, considering k-mers of length 8. These results demonstrate the effectiveness of our approach and, as future work, we are going to tune our methodology to improve classification results at genus level, implementing a consensus mechanism.
2013
Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
978-88-906437-2-9
Genomic classification
Alignment-free analysis
16S rRNA
DNA k-mers
Topic modeling
LDA
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/278030
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 8
  • ???jsp.display-item.citation.isi??? 4
social impact