CNR Institutional Research Information System

Data mining is being increasingly used in biology. Biologists are adopting prototyping languages, like R and Matlab, to facilitate the application of data mining algorithms to their data. As a result, their scripts are becoming increasingly complex and also require frequent updates. Application to large datasets becomes impractical and the time-to-paper increases. Furthermore, even if there are various systems that can be used to efficiently process large datasets, for example, using Cloud and High Performance Computing, they usually require procedures to be translated into specific languages or to be adapted to a certain computing platform. Such modifications can speed up the processing, but translation is not automatic, especially in complex cases, and can require a large amount of programming effort and accurate validation. In this paper, we propose an approach to parallelize data mining procedures in the form of compiled software or R scripts developed by biology communities of practice. Our approach requires minimal alteration of the original code. In many cases, there is no need for code modification. Furthermore, it allows for fast updating when a new version is ready. We clarify the constraints and the benefits of our method and report a practical use case to demonstrate such benefits compared with a standard execution. Our approach relies on a distributed network of web services and ultimately exposes the algorithms as-a-Service, to be invoked by remote thin clients.

Parallelizing the execution of native data mining algorithms for computational biology

Coro G;Candela L;Pagano P;Italiano A;Liccardo L

2014

Abstract

Data mining is being increasingly used in biology. Biologists are adopting prototyping languages, like R and Matlab, to facilitate the application of data mining algorithms to their data. As a result, their scripts are becoming increasingly complex and also require frequent updates. Application to large datasets becomes impractical and the time-to-paper increases. Furthermore, even if there are various systems that can be used to efficiently process large datasets, for example, using Cloud and High Performance Computing, they usually require procedures to be translated into specific languages or to be adapted to a certain computing platform. Such modifications can speed up the processing, but translation is not automatic, especially in complex cases, and can require a large amount of programming effort and accurate validation. In this paper, we propose an approach to parallelize data mining procedures in the form of compiled software or R scripts developed by biology communities of practice. Our approach requires minimal alteration of the original code. In many cases, there is no need for code modification. Furthermore, it allows for fast updating when a new version is ready. We clarify the constraints and the benefits of our method and report a practical use case to demonstrate such benefits compared with a standard execution. Our approach relies on a distributed network of web services and ultimately exposes the algorithms as-a-Service, to be invoked by remote thin clients.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2014
			
	Strutture organizzative
	
				Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
			
	Parole chiave
	
				Data mining
Parallel processing
Cloud computing
Computational biology
Distributed systems
R prototyping language
			
	Appare nelle tipologie:
	
				01.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
prod_293732-doc_109394.pdf solo utenti autorizzati Descrizione: Parallelizing the execution of native data mining algorithms for computational biology Tipologia: Versione Editoriale (PDF) Dimensione 764.79 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	764.79 kB	Adobe PDF	Visualizza/Apri Richiedi una copia
prod_293732-doc_200359.pdf accesso aperto Descrizione: Preprint - Parallelizing the execution of native data mining algorithms for computational biology Tipologia: Versione Editoriale (PDF) Dimensione 425.26 kB Formato Adobe PDF Visualizza/Apri	425.26 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/260468

Citazioni

ND

36

30

social impact