Highly accurate genotyping is essential for genomic projects aimed at understanding the etiology of diseases as well as for routinary screening of patients. For this reason, genotyping software packages are subject to a strict validation process that requires a large amount of sequencing data endowed with accurate genotype information. In-vitro assessment of genotyping is a long, complex and expensive activity that also depends on the specific variation and locus, and thus it cannot really be used for validation of in-silico genotyping algorithms. In this scenario, sequencing simulation has emerged as a practical alternative. Simulators must be able to keep up with the continuous improvement of different sequencing technologies producing datasets as much indistinguishable from real ones as possible. Moreover, they must be able to mimic as many types of genomic variant as possible. In this paper we describe OmniSim a simulator whose ultimate goal is that of being suitable in all the possible applicative scenarios. In order to fulfill this goal, OmniSim uses an abstract model where variations are read from a.vcf file and mapped into edit operations (insertion, deletion, substitution) on the reference genome. Technological parameters (e.g. error distributions, read length and per-base quality) are learned from real data. As a result of the combination of our abstract model and parameter learning module, OmniSim is able to output data in all aspects similar to that produced in a real sequencing experiment. The source code of OmniSim is freely available at the URL: https://gitlab.com/geraci/omnisim.

Technology and species independent simulation of sequencing data and genomic variants

Geraci F;
2019

Abstract

Highly accurate genotyping is essential for genomic projects aimed at understanding the etiology of diseases as well as for routinary screening of patients. For this reason, genotyping software packages are subject to a strict validation process that requires a large amount of sequencing data endowed with accurate genotype information. In-vitro assessment of genotyping is a long, complex and expensive activity that also depends on the specific variation and locus, and thus it cannot really be used for validation of in-silico genotyping algorithms. In this scenario, sequencing simulation has emerged as a practical alternative. Simulators must be able to keep up with the continuous improvement of different sequencing technologies producing datasets as much indistinguishable from real ones as possible. Moreover, they must be able to mimic as many types of genomic variant as possible. In this paper we describe OmniSim a simulator whose ultimate goal is that of being suitable in all the possible applicative scenarios. In order to fulfill this goal, OmniSim uses an abstract model where variations are read from a.vcf file and mapped into edit operations (insertion, deletion, substitution) on the reference genome. Technological parameters (e.g. error distributions, read length and per-base quality) are learned from real data. As a result of the combination of our abstract model and parameter learning module, OmniSim is able to output data in all aspects similar to that produced in a real sequencing experiment. The source code of OmniSim is freely available at the URL: https://gitlab.com/geraci/omnisim.
2019
Istituto di informatica e telematica - IIT
NGS sequencing
simulation
genomic variants
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/363464
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? ND
social impact