In this work we introduce a novel alignment-free genomic classification approach based on probabilistic topic modeling. Using a kmer (small fragments of length k) decomposition of DNA sequences and the LDA algorithm, we built a classifier for 16S rRNA bacterial gene sequences. We tested our method with a ten-fold cross validation procedure considering a bacteria dataset of 3000 elements belonging to the most numerous bacteria phyla: Actinobacteria, Firmicutes and Proteobacteria. Our results, in terms of precision scores and for different number of topics, ranges from 100%, at class level, to 77% at genus level, considering k-mers of length 8. These results demonstrate the effectiveness of our approach and, as future work, we are going to tune our methodology to improve classification results at genus level, implementing a consensus mechanism.

Genomic Sequence Classification using Probabilistic Topic Modeling

Massimo La Rosa;Antonino Fiannaca;Riccardo Rizzo;Alfonso Urso
2013

Abstract

In this work we introduce a novel alignment-free genomic classification approach based on probabilistic topic modeling. Using a kmer (small fragments of length k) decomposition of DNA sequences and the LDA algorithm, we built a classifier for 16S rRNA bacterial gene sequences. We tested our method with a ten-fold cross validation procedure considering a bacteria dataset of 3000 elements belonging to the most numerous bacteria phyla: Actinobacteria, Firmicutes and Proteobacteria. Our results, in terms of precision scores and for different number of topics, ranges from 100%, at class level, to 77% at genus level, considering k-mers of length 8. These results demonstrate the effectiveness of our approach and, as future work, we are going to tune our methodology to improve classification results at genus level, implementing a consensus mechanism.
2013
Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
Inglese
Enrico Formenti, Roberto Tagliaferri, Ernst Wit
Computational Intelligence Methods for Bioinformatics and Biostatistics
CIBB 2013, Tenth International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics
49
61
12
978-88-906437-2-9
Sì, ma tipo non specificato
June 20-22, 2013
Valrose Castle, Nice, France.
Genomic classification
Alignment-free analysis
16S rRNA
DNA k-mers
Topic modeling
LDA
4
none
Massimo La Rosa; Antonino Fiannaca; Riccardo Rizzo; Alfonso Urso
273
info:eu-repo/semantics/conferenceObject
04 Contributo in convegno::04.01 Contributo in Atti di convegno
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/278030
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 8
  • ???jsp.display-item.citation.isi??? 4
social impact