Alignment-free methods are routinely used in largescale, gene-independent phylogeny reconstruction. Such methods measure the similarity of two genomes by comparing the frequency of all their distinct substrings of length k. In this paper we apply logic data mining methods to discover a minimal subset of k-mers whose frequency information is sufficient to reliably classify bacterial genomes into the corresponding taxa. Specifically, we extract separating, disjunctive normal form logic formulas, predicated on the discretized relative frequencies of few selected k-mers in the genomes. Such formulas are derived using a combination of feature selection, integer programming and adhoc heuristics. Interestingly, we reliably classify strain genomes at multiple taxonomic levels using extremely compact formulas, each involving just few k-mers. Classification performance is promising, suggesting that the phylogenetic signal of each class is strong enough and that our discretization and feature selection approach is effective and robust in identifying it.

Classifying bacterial genomes with compact logic formulas on k-mer frequencies

Felici G
2014

Abstract

Alignment-free methods are routinely used in largescale, gene-independent phylogeny reconstruction. Such methods measure the similarity of two genomes by comparing the frequency of all their distinct substrings of length k. In this paper we apply logic data mining methods to discover a minimal subset of k-mers whose frequency information is sufficient to reliably classify bacterial genomes into the corresponding taxa. Specifically, we extract separating, disjunctive normal form logic formulas, predicated on the discretized relative frequencies of few selected k-mers in the genomes. Such formulas are derived using a combination of feature selection, integer programming and adhoc heuristics. Interestingly, we reliably classify strain genomes at multiple taxonomic levels using extremely compact formulas, each involving just few k-mers. Classification performance is promising, suggesting that the phylogenetic signal of each class is strong enough and that our discretization and feature selection approach is effective and robust in identifying it.
2014
Istituto di Analisi dei Sistemi ed Informatica ''Antonio Ruberti'' - IASI
bioinformatics data mining feature selection genomics integer programming microorganisms pattern classification
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/297169
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 5
  • ???jsp.display-item.citation.isi??? ND
social impact