In this paper we address the Pure Parsimony Haplotyping problem: Given a set of genotypes G, find a minimum cardinality set of haplotypes which explains G. A genotype g is an n-dimensional vector in {0, 1, 2} and a haplotype h is an n-dimensional binary vector. A set of haplotypes H is said to explain G if for every g 2 G there are h1, h2 2 H such that h1 h2 = g. The position-wise sum h1 h2 indicates the genotype which has a 2 in the positions where h1 and h2 disagree, and the same value as h1 and h2 where they agree. In this paper we show the APX-hardness of the problem even when the number of "2" symbols is at most 3 for every g 2 G. We then give two Integer Linear Programming formulations. From the first one (appeared also in [9]) we derive a 2k-1-approximation algorithm when the number of symbols "2" is at most k for every g 2 G. This formulation has an exponential number of variables and constraints. The second formulation is new, and has a polynomial number of variables and constraints. Finally, we give approximation algorithms not based on Linear Programming, such as a trivial p|G|-approximation algorithm for the general case, and an effective probabilistic algorithm with a performance guarantee of 2k+1blog |G|c (1 + dln |G|e) when the number of symbols "2" is at most k for every g 2 G. The expected running time of the algorithm turns out to be almost linear in the input size. 1 Introduction A Single Nucleotide Polymorphism (SNP) is a site of the human genome (i.e., the posi- tion of a specific nucleotide), whose content shows a statistically significant variability within a population. A position is considered a SNP if for the minority of the pop- ulation (as long as it consists of at least 5% of the population) a certain nucleotide

Haplotyping populations by pure parsinomy: complexity, exact and approximation algorithms

2004

Abstract

In this paper we address the Pure Parsimony Haplotyping problem: Given a set of genotypes G, find a minimum cardinality set of haplotypes which explains G. A genotype g is an n-dimensional vector in {0, 1, 2} and a haplotype h is an n-dimensional binary vector. A set of haplotypes H is said to explain G if for every g 2 G there are h1, h2 2 H such that h1 h2 = g. The position-wise sum h1 h2 indicates the genotype which has a 2 in the positions where h1 and h2 disagree, and the same value as h1 and h2 where they agree. In this paper we show the APX-hardness of the problem even when the number of "2" symbols is at most 3 for every g 2 G. We then give two Integer Linear Programming formulations. From the first one (appeared also in [9]) we derive a 2k-1-approximation algorithm when the number of symbols "2" is at most k for every g 2 G. This formulation has an exponential number of variables and constraints. The second formulation is new, and has a polynomial number of variables and constraints. Finally, we give approximation algorithms not based on Linear Programming, such as a trivial p|G|-approximation algorithm for the general case, and an effective probabilistic algorithm with a performance guarantee of 2k+1blog |G|c (1 + dln |G|e) when the number of symbols "2" is at most k for every g 2 G. The expected running time of the algorithm turns out to be almost linear in the input size. 1 Introduction A Single Nucleotide Polymorphism (SNP) is a site of the human genome (i.e., the posi- tion of a specific nucleotide), whose content shows a statistically significant variability within a population. A position is considered a SNP if for the minority of the pop- ulation (as long as it consists of at least 5% of the population) a certain nucleotide
2004
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
Distributed Systems
File in questo prodotto:
File Dimensione Formato  
prod_44124-doc_124914.pdf

solo utenti autorizzati

Descrizione: Haplotyping populations by pure parsinomy: complexity, exact and approximation algorithms
Tipologia: Versione Editoriale (PDF)
Dimensione 223.93 kB
Formato Adobe PDF
223.93 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/39987
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact