In this work we introduce a novel alignment-free genomic classification approach based on probabilistic topic modeling. Using a kmer (small fragments of length k) decomposition of DNA sequences and the LDA algorithm, we built a classifier for 16S rRNA bacterial gene sequences. We tested our method with a ten-fold cross validation procedure considering a bacteria dataset of 3000 elements belonging to the most numerous bacteria phyla: Actinobacteria, Firmicutes and Proteobacteria. Our results, in terms of precision scores and for different number of topics, ranges from 100%, at class level, to 77% at genus level, considering k-mers of length 8. These results demonstrate the effectiveness of our approach and, as future work, we are going to tune our methodology to improve classification results at genus level, implementing a consensus mechanism.
Genomic Sequence Classification using Probabilistic Topic Modeling
Massimo La Rosa;Antonino Fiannaca;Riccardo Rizzo;Alfonso Urso
2013
Abstract
In this work we introduce a novel alignment-free genomic classification approach based on probabilistic topic modeling. Using a kmer (small fragments of length k) decomposition of DNA sequences and the LDA algorithm, we built a classifier for 16S rRNA bacterial gene sequences. We tested our method with a ten-fold cross validation procedure considering a bacteria dataset of 3000 elements belonging to the most numerous bacteria phyla: Actinobacteria, Firmicutes and Proteobacteria. Our results, in terms of precision scores and for different number of topics, ranges from 100%, at class level, to 77% at genus level, considering k-mers of length 8. These results demonstrate the effectiveness of our approach and, as future work, we are going to tune our methodology to improve classification results at genus level, implementing a consensus mechanism.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.