A wide class of approximate pattern matching algorithms are based on a filtration phase in which spaced seeds are used to discard regions where a match is not likely to occur. The problem of determining the "optimal" shape of a spaced seed in a particular setting is known to be a hard one: in practice spaced seeds are chosen using heuristics or considering a restricted family of seeds with "reasonably good" performances. In this paper we consider the family of spaced seeds with a periodic structure. Such seeds have been already proven valuable both as a theoretical tool and in bioinformatics applications. We show that known combinatorial objects, namely Difference Sets and Families and Steiner Systems, naturally lead to the design of lossless periodic seeds for approximate pattern matching with k=2 and k=3 mismatches. We analyze in depth the properties of the resulting seeds obtaining insights also on seeds without a periodic structure. The results of the analysis are then used to guide an experimental evaluation of the effectiveness of periodic seeds for pattern lengths of practical interest. Our results give a complete picture of strengths and limitations of periodic seeds, and can be used by practitioners for the design of effective approximate pattern matching algorithms.
Design and analysis of periodic multiple seeds
Manzini G
2014
Abstract
A wide class of approximate pattern matching algorithms are based on a filtration phase in which spaced seeds are used to discard regions where a match is not likely to occur. The problem of determining the "optimal" shape of a spaced seed in a particular setting is known to be a hard one: in practice spaced seeds are chosen using heuristics or considering a restricted family of seeds with "reasonably good" performances. In this paper we consider the family of spaced seeds with a periodic structure. Such seeds have been already proven valuable both as a theoretical tool and in bioinformatics applications. We show that known combinatorial objects, namely Difference Sets and Families and Steiner Systems, naturally lead to the design of lossless periodic seeds for approximate pattern matching with k=2 and k=3 mismatches. We analyze in depth the properties of the resulting seeds obtaining insights also on seeds without a periodic structure. The results of the analysis are then used to guide an experimental evaluation of the effectiveness of periodic seeds for pattern lengths of practical interest. Our results give a complete picture of strengths and limitations of periodic seeds, and can be used by practitioners for the design of effective approximate pattern matching algorithms.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.