OBJECTIVES: In this paper, an alignment-free method for DNA barcode classification that is based on both a spectral representation and a neural gas network for unsupervised clustering is proposed. METHODS: In the proposed methodology, distinctive words are identified from a spectral representation of DNA sequences. A taxonomic classification of the DNA sequence is then performed using the sequence signature, i.e., the smallest set of k-mers that can assign a DNA sequence to its proper taxonomic category. Experiments were then performed to compare our method with other supervised machine learning classification algorithms, such as support vector machine, random forest, ripper, naïve Bayes, ridor, and classification tree, which also consider short DNA sequence fragments of 200 and 300 base pairs (bp). The experimental tests were conducted over 10 real barcode datasets belonging to different animal species, which were provided by the on-line resource "Barcode of Life Database". RESULTS: The experimental results showed that our k-mer-based approach is directly comparable, in terms of accuracy, recall and precision metrics, with the other classifiers when considering full-length sequences. In addition, we demonstrate the robustness of our method when a classification is performed task with a set of short DNA sequences that were randomly extracted from the original data. For example, the proposed method can reach the accuracy of 64.8% at the species level with 200-bp fragments. Under the same conditions, the best other classifier (random forest) reaches the accuracy of 20.9%. CONCLUSIONS: Our results indicate that we obtained a clear improvement over the other classifiers for the study of short DNA barcode sequence fragments.

A k-mer-based barcode DNA classification methodology based on spectral representation and a neural gas network

Fiannaca Antonino;La Rosa Massimo;Rizzo Riccardo;Urso Alfonso
2015

Abstract

OBJECTIVES: In this paper, an alignment-free method for DNA barcode classification that is based on both a spectral representation and a neural gas network for unsupervised clustering is proposed. METHODS: In the proposed methodology, distinctive words are identified from a spectral representation of DNA sequences. A taxonomic classification of the DNA sequence is then performed using the sequence signature, i.e., the smallest set of k-mers that can assign a DNA sequence to its proper taxonomic category. Experiments were then performed to compare our method with other supervised machine learning classification algorithms, such as support vector machine, random forest, ripper, naïve Bayes, ridor, and classification tree, which also consider short DNA sequence fragments of 200 and 300 base pairs (bp). The experimental tests were conducted over 10 real barcode datasets belonging to different animal species, which were provided by the on-line resource "Barcode of Life Database". RESULTS: The experimental results showed that our k-mer-based approach is directly comparable, in terms of accuracy, recall and precision metrics, with the other classifiers when considering full-length sequences. In addition, we demonstrate the robustness of our method when a classification is performed task with a set of short DNA sequences that were randomly extracted from the original data. For example, the proposed method can reach the accuracy of 64.8% at the species level with 200-bp fragments. Under the same conditions, the best other classifier (random forest) reaches the accuracy of 20.9%. CONCLUSIONS: Our results indicate that we obtained a clear improvement over the other classifiers for the study of short DNA barcode sequence fragments.
2015
Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
Alignment-free analysis
DNA barcode classification
k-Mer representation
Neural gas
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/300847
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? 20
social impact