The number of domains affected by the big data phenomenon is constantly increasing, both in science and industry, with high-throughput DNA sequencers being among the most massive data producers. Building analysis frameworks that can keep up with such a high production rate, however, is only part of the problem: current challenges include dealing with articulated data repositories where objects are connected by multiple relationships, managing complex processing pipelines where each step depends on a large number of configuration parameters and ensuring reproducibility, error control and usability by non-technical staff. Here we describe an automated infrastructure built to address the above issues in the context of the analysis of the data produced by the CRS4 next-generation sequencing facility. The system integrates open source tools, either written by us or publicly available, into a framework that can handle the whole data transformation process, from raw sequencer output to primary analysis results.

An automated infrastructure to support high-throughput bioinformatics

Angius A;
2014

Abstract

The number of domains affected by the big data phenomenon is constantly increasing, both in science and industry, with high-throughput DNA sequencers being among the most massive data producers. Building analysis frameworks that can keep up with such a high production rate, however, is only part of the problem: current challenges include dealing with articulated data repositories where objects are connected by multiple relationships, managing complex processing pipelines where each step depends on a large number of configuration parameters and ensuring reproducibility, error control and usability by non-technical staff. Here we describe an automated infrastructure built to address the above issues in the context of the analysis of the data produced by the CRS4 next-generation sequencing facility. The system integrates open source tools, either written by us or publicly available, into a framework that can handle the whole data transformation process, from raw sequencer output to primary analysis results.
2014
Istituto di Ricerca Genetica e Biomedica - IRGB
Bioinformatics
MapReduce
NGS
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/245211
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 7
  • ???jsp.display-item.citation.isi??? ND
social impact