DNA barcodes - one or multiple very short gene sequences - have been proven effective to classify a specimen to species. To handle this task in the plant and fungus kingdoms, multi-locus DNA barcode data as well as sequence analysis techniques are demanded, posing new challenges. In this work, we describe LAF-BARCODING, a Logic Alignment Free technique that counts the number of fixed-length substrings (k-mers) of the input sequences, represents them in feature vectors, and classifies them through a rule-based approach in order to specifically assign multi-locus DNA barcode sequences to their corresponding species. We use LAF to classify several sets of DNA barcode sequences, belonging to the plant and fungus life kingdoms, obtaining compact and meaningful classification models (if-then rules) with high accuracy rates. Conversely to the widespread alignmentbased (e.g., character, tree, and similarity) methods, we highlight that LAF can be successfully applied to multi-locus DNA barcode sequences.
LAF Barcoding: classifying DNA Barcode multi-locus sequences with feature vectors and supervised approaches
Emanuel Weitschek;Giulia Fiscon;Paola Bertolazzi;Giovanni Felici
2015
Abstract
DNA barcodes - one or multiple very short gene sequences - have been proven effective to classify a specimen to species. To handle this task in the plant and fungus kingdoms, multi-locus DNA barcode data as well as sequence analysis techniques are demanded, posing new challenges. In this work, we describe LAF-BARCODING, a Logic Alignment Free technique that counts the number of fixed-length substrings (k-mers) of the input sequences, represents them in feature vectors, and classifies them through a rule-based approach in order to specifically assign multi-locus DNA barcode sequences to their corresponding species. We use LAF to classify several sets of DNA barcode sequences, belonging to the plant and fungus life kingdoms, obtaining compact and meaningful classification models (if-then rules) with high accuracy rates. Conversely to the widespread alignmentbased (e.g., character, tree, and similarity) methods, we highlight that LAF can be successfully applied to multi-locus DNA barcode sequences.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.