Machine learning is a widely used technique in structural biology, since the analysis of large conformational ensembles originated from single protein structures (e.g. derived from NMR experiments or molecular dynamics simulations) can be approached by partitioning the original dataset into sensible subsets, revealing important structural and dynamics behaviours. Clustering is a good unsupervised approach for dealing with these ensembles of structures, in order to identify stable conformations and driving characteristics shared by the different structures. A common problem of the applications that implement protein clustering is the scalability of the performance, in particular concerning the data load into memory. In this work we show how it is possible to improve the parallel performance of the GROMOS clustering algorithm by using Hadoop. The preliminary results show the validity of this approach, providing a hint for future development in this field.

Clustering protein structures with Hadoop

G Paschina;L Roverelli;D D'Agostino;F Chiappori;I Merelli
2016

Abstract

Machine learning is a widely used technique in structural biology, since the analysis of large conformational ensembles originated from single protein structures (e.g. derived from NMR experiments or molecular dynamics simulations) can be approached by partitioning the original dataset into sensible subsets, revealing important structural and dynamics behaviours. Clustering is a good unsupervised approach for dealing with these ensembles of structures, in order to identify stable conformations and driving characteristics shared by the different structures. A common problem of the applications that implement protein clustering is the scalability of the performance, in particular concerning the data load into memory. In this work we show how it is possible to improve the parallel performance of the GROMOS clustering algorithm by using Hadoop. The preliminary results show the validity of this approach, providing a hint for future development in this field.
2016
Istituto di Matematica Applicata e Tecnologie Informatiche - IMATI -
Istituto di Tecnologie Biomediche - ITB
Inglese
Angelini C., Rancoita P., Rovetta S.
Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2015
Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB)
141
153
978-3-319-44332-4
http://link.springer.com/chapter/10.1007/978-3-319-44332-4_11
Springer International Publishing
Switzerland
SVIZZERA
Sì, ma tipo non specificato
10-12/9/2015
Naples, Italy
Hadoop Clustering
protein structures
Molecular dynamics
Data parallel
5
restricted
Paschina, G; Roverelli, L; D'Agostino, D; Chiappori, F; Merelli, I
273
info:eu-repo/semantics/conferenceObject
04 Contributo in convegno::04.01 Contributo in Atti di convegno
   Methods for Integrated analysis of Multiple Omics datasets
   MIMOMICS
   FP7
   305280
File in questo prodotto:
File Dimensione Formato  
prod_357543-doc_130978.pdf

solo utenti autorizzati

Descrizione: Clustering Protein Structures with Hadoop
Tipologia: Versione Editoriale (PDF)
Dimensione 3.05 MB
Formato Adobe PDF
3.05 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/320436
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? ND
social impact