CNR Institutional Research Information System

MOTIVATION: A dictionary of k-mers is a data structure that stores a set of n distinct k-mers and supports membership queries. This data structure is at the hearth of many important tasks in computational biology. High-throughput sequencing of DNA can produce very large k-mer sets, in the size of billions of strings-in such cases, the memory consumption and query efficiency of the data structure is a concrete challenge. RESULTS: To tackle this problem, we describe a compressed and associative dictionary for k-mers, that is: a data structure where strings are represented in compact form and each of them is associated to a unique integer identifier in the range [0,n). We show that some statistical properties of k-mer minimizers can be exploited by minimal perfect hashing to substantially improve the space/time trade-off of the dictionary compared to the best-known solutions. AVAILABILITY AND IMPLEMENTATION: https://github.com/jermp/sshash. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Sparse and skew hashing of K-mers

Pibiri GE

2022

Abstract

MOTIVATION: A dictionary of k-mers is a data structure that stores a set of n distinct k-mers and supports membership queries. This data structure is at the hearth of many important tasks in computational biology. High-throughput sequencing of DNA can produce very large k-mer sets, in the size of billions of strings-in such cases, the memory consumption and query efficiency of the data structure is a concrete challenge. RESULTS: To tackle this problem, we describe a compressed and associative dictionary for k-mers, that is: a data structure where strings are represented in compact form and each of them is associated to a unique integer identifier in the range [0,n). We show that some statistical properties of k-mer minimizers can be exploited by minimal perfect hashing to substantially improve the space/time trade-off of the dictionary compared to the best-known solutions. AVAILABILITY AND IMPLEMENTATION: https://github.com/jermp/sshash. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2022
			
	Strutture organizzative
	
				Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
			
	Lingua/e
	
				Inglese
			
	Rivista
	
				BIOINFORMATICS (OXF., ONLINE)
			
	Codice Web of Science
	
				WOS:000817250400027
			
	Volume
	
				38
			
	Fascicolo
	
				1
			
	Da pagina
	
				i185
			
	A pagina
	
				i194
			
	Codice DOI
	
				https://dx.doi.org/10.1093/bioinformatics/btac245
			
	Codice Scopus
	
				2-s2.0-85132963462
			
	URL
	
				https://academic.oup.com/bioinformatics/article/38/Supplement_1/i185/6617506?login=true
			
	Referee
	
				Sì, ma tipo non specificato
			
	Parole chiave
	
				Bioinformatics
Time trade-off method
			
	Altre informazioni
	
				Materiale supplementare disponibile ad accesso aperto qui: https://github.com/jermp/sshash#datasets
			
	Numero autori
	
				1
			
	Tipologia
	
				info:eu-repo/semantics/article
			
	Tipologia Login Miur
	
				262
			
	Tutti gli autori
	
						Pibiri, Ge
					
	Tipologia
	
				01 Contributo su Rivista::01.01 Articolo in rivista
			
	Fulltext
	
				open
			
	Identificativo progetto
	
	Titolo Progetto
	
									Labs for prototyping future Mobility Data sharing cloud solutions
								
	Acronimo
	
									MobiDataLab
								
	Finanziamento
	
									H2020
								
	N. Contratto
	
									101006879
								
	Appare nelle tipologie:
	
				01.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
prod_468874-doc_189656.pdf accesso aperto Descrizione: Sparse and skew hashing of K-mers Tipologia: Versione Editoriale (PDF) Dimensione 527.75 kB Formato Adobe PDF Visualizza/Apri	527.75 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/414813

Citazioni

ND

48

40

social impact