The output of a high throughput next generation sequencing (NGS) machine is a collection of short reads, which have to be properly assembled in order to reconstruct the original DNA sequence of the analyzed organism (Metzker, 2010; Earl, 2011). The DNA sequence assembly process is based on aligning and merging these reads for effectively reconstructing the real primary structure of the DNA sample sequence or reference genome. The use of NGS machines results in much larger sets of reads to be assembled, posing new problems for computer scientists and bioinformaticians. In particular, a relevant issue is related with the trade-off between the precision of the assembly process and its computational time, stating the need for faster methods that can keep pace with the speed and volume of reads that are generated with NGS. An important step in DNA assembly is the identification of a subset of read pairs that have a high probability of being aligned sequentially in the reconstruction. Such step is often referred to as filtering, and amounts in selecting a significantly smaller subset of the initial set of read pairs (whose dimension is quadratic in the number of initial reads) that can be then processed by an alignment algorithm, usually quite time consuming. The desired effect of filtering is then to quickly filter out from the candidate set of read pairs those that would not provide a good alignment in the following phase. The computation cost of filtering should then be balanced by the speed-up obtained when a smaller set of read pairs is considered for alignment. In this work we propose and test the use of alignment free distances to evaluate the similarity between two short reads as a technique for filtering good read pairs to be assembled. The method operates in constant time in the string length and is tested in its ability to emulate, with a proper level of precision, much more time consuming methods to evaluate the similarity between short DNA sequences, such as the established Needleman-Wunsch edit distance (Needleman, Wunsch, and Christian, 1970), often used in the final step of the assembly procedure. These preliminary experiments show the efficacy of this approach for filtering the promising read pairs - eligible candidates to successfully assemble the entire genome of a given organism. Therefore, the alignment free reads filtering may significantly accelerate the assembly process without a substantial loss in accuracy of the DNA sample sequence reconstruction.

Filtering with alignment free distances for high throughput DNA reads assembly

Giovanni Felici;Daniele Santoni;Emanuel Weitschek
2012

Abstract

The output of a high throughput next generation sequencing (NGS) machine is a collection of short reads, which have to be properly assembled in order to reconstruct the original DNA sequence of the analyzed organism (Metzker, 2010; Earl, 2011). The DNA sequence assembly process is based on aligning and merging these reads for effectively reconstructing the real primary structure of the DNA sample sequence or reference genome. The use of NGS machines results in much larger sets of reads to be assembled, posing new problems for computer scientists and bioinformaticians. In particular, a relevant issue is related with the trade-off between the precision of the assembly process and its computational time, stating the need for faster methods that can keep pace with the speed and volume of reads that are generated with NGS. An important step in DNA assembly is the identification of a subset of read pairs that have a high probability of being aligned sequentially in the reconstruction. Such step is often referred to as filtering, and amounts in selecting a significantly smaller subset of the initial set of read pairs (whose dimension is quadratic in the number of initial reads) that can be then processed by an alignment algorithm, usually quite time consuming. The desired effect of filtering is then to quickly filter out from the candidate set of read pairs those that would not provide a good alignment in the following phase. The computation cost of filtering should then be balanced by the speed-up obtained when a smaller set of read pairs is considered for alignment. In this work we propose and test the use of alignment free distances to evaluate the similarity between two short reads as a technique for filtering good read pairs to be assembled. The method operates in constant time in the string length and is tested in its ability to emulate, with a proper level of precision, much more time consuming methods to evaluate the similarity between short DNA sequences, such as the established Needleman-Wunsch edit distance (Needleman, Wunsch, and Christian, 1970), often used in the final step of the assembly procedure. These preliminary experiments show the efficacy of this approach for filtering the promising read pairs - eligible candidates to successfully assemble the entire genome of a given organism. Therefore, the alignment free reads filtering may significantly accelerate the assembly process without a substantial loss in accuracy of the DNA sample sequence reconstruction.
2012
Istituto di Analisi dei Sistemi ed Informatica ''Antonio Ruberti'' - IASI
alignment free
dna assembly
sequencing
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/18848
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact