Multiple sequence alignment (MSA) is one of the oldest problems in bioinformatics and represents a key step in sequence analysis applications such as phylogenetic inference and comparative genomics. Progressive alignment methods are one of the most widely used for DNA multiple alignments (e.g. Edgar, 2004; Katoh, 2002). Without the use of protein information, those methods simply constructs of a guide tree, inferred from the DNA sequences, and the resulting alignment is then built following the tree topology (Zhan, 2015). Algorithms such as translatorX (Abascal, et al., 2010) and tranalign (EMBOSS (Rice, et al., 2000)), taking advantage of the higher conservation of proteins than their respective coding sequences, use the protein alignment to guide the reconstruction of accurate MSAs at nucleotide level. However, these methods do not make use of information embedded in protein domains. Moreover, when applied to genomic sequences, such approaches account neither for intron occurrence nor for gene order variations (e.g. mitochondrial genomes (D'Onorio de Meo, et al., 2012) resulting in apparently truncated protein domains or variations in their arrangement along the genome regions. Moreover, using a protein guided DNA multiple alignment requires a curated DNA sequences, stop codons free when translated to amino acid ones. Here we present a DNA multiple sequence alignment framework (MSA-PAD), which resolves the problem of premature stop codon occurring in conceptual translations of DNA sequences coming from next generation sequencing platforms which may contain errors. MSA-PAD translates DNA sequences into amino acids (based on user-defined genetic code and reading frame/s), uses information from conserved PFAM domains (Finn, et al., 2014) to assign the translated sequences to known protein domains, accounts for frameshifts when domain regions are split by introns, performs a domain-based protein alignment and then uses protein alignment information to generate the relevant nucleotide multiple alignment. MSA-PAD has two different alignment strategies: (i) Gene and (ii) Genome. Gene mode alignment respects domain order organization from 5' to 3', and resolves the alignment of repetitive domains even when they are repeated in tandem. Genome mode provides a supergene-like alignment ignoring domain order constraints accounting by that genomic rearrangements.

MSA-PAD: Novel DNA Multiple Sequence Alignment Guided by PFAM Conserved Domains

Bachir Balech;Saverio Vicario;Graziano Pesole
2015

Abstract

Multiple sequence alignment (MSA) is one of the oldest problems in bioinformatics and represents a key step in sequence analysis applications such as phylogenetic inference and comparative genomics. Progressive alignment methods are one of the most widely used for DNA multiple alignments (e.g. Edgar, 2004; Katoh, 2002). Without the use of protein information, those methods simply constructs of a guide tree, inferred from the DNA sequences, and the resulting alignment is then built following the tree topology (Zhan, 2015). Algorithms such as translatorX (Abascal, et al., 2010) and tranalign (EMBOSS (Rice, et al., 2000)), taking advantage of the higher conservation of proteins than their respective coding sequences, use the protein alignment to guide the reconstruction of accurate MSAs at nucleotide level. However, these methods do not make use of information embedded in protein domains. Moreover, when applied to genomic sequences, such approaches account neither for intron occurrence nor for gene order variations (e.g. mitochondrial genomes (D'Onorio de Meo, et al., 2012) resulting in apparently truncated protein domains or variations in their arrangement along the genome regions. Moreover, using a protein guided DNA multiple alignment requires a curated DNA sequences, stop codons free when translated to amino acid ones. Here we present a DNA multiple sequence alignment framework (MSA-PAD), which resolves the problem of premature stop codon occurring in conceptual translations of DNA sequences coming from next generation sequencing platforms which may contain errors. MSA-PAD translates DNA sequences into amino acids (based on user-defined genetic code and reading frame/s), uses information from conserved PFAM domains (Finn, et al., 2014) to assign the translated sequences to known protein domains, accounts for frameshifts when domain regions are split by introns, performs a domain-based protein alignment and then uses protein alignment information to generate the relevant nucleotide multiple alignment. MSA-PAD has two different alignment strategies: (i) Gene and (ii) Genome. Gene mode alignment respects domain order organization from 5' to 3', and resolves the alignment of repetitive domains even when they are repeated in tandem. Genome mode provides a supergene-like alignment ignoring domain order constraints accounting by that genomic rearrangements.
2015
Multiple Sequence Alignment
Molecular Biodiversity
Protein Domanis
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/379774
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact