CNR Institutional Research Information System

The problem of sequence identification or matching--determining the subset of reference sequences from a givencollection that are likely to contain a short, queried nucleotide sequence--is relevant for many important tasksin Computational Biology, such as metagenomics and pangenome analysis. Due to the complex nature of such analysesand the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance.This poses the threefold challenge of representing the reference collection with a data structure that is efficientto query, has light memory usage, and scales well to large collections. To solve this problem, we describe an efficientcolored de Bruijn graph index, arising as the combination of a k-mer dictionary with a compressed inverted index. Theproposed index takes full advantage of the fact that unitigs in the colored compacted de Bruijn graph are monochromatic(i.e., all k-mers in a unitig have the same set of references of origin, or color). Specifically, the unitigs are keptin the dictionary in color order, thereby allowing for the encoding of the map from k-mers to their colors in as littleas 1 + o(1) bits per unitig. Hence, one color per unitig is stored in the index with almost no space/time overhead. Bycombining this property with simple but effective compression methods for integer lists, the index achieves verysmall space. We implement these methods in a tool called Fulgor, and conduct an extensive experimental analysisto demonstrate the improvement of our tool over previous solutions. For example, compared to Themisto--thestrongest competitor in terms of index space vs. query time trade-off--Fulgor requires significantly less space (upto 43% less space for a collection of 150,000 Salmonella enterica genomes), is at least twice as fast for color queries,and is 2-6× faster to construct.

Fulgor: a fast and compact k-mer index for large-scale matching and color queries

Fan J;Khan J;Pratap Singh N;Pibiri GE;Patro R

2024

Abstract

The problem of sequence identification or matching--determining the subset of reference sequences from a givencollection that are likely to contain a short, queried nucleotide sequence--is relevant for many important tasksin Computational Biology, such as metagenomics and pangenome analysis. Due to the complex nature of such analysesand the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance.This poses the threefold challenge of representing the reference collection with a data structure that is efficientto query, has light memory usage, and scales well to large collections. To solve this problem, we describe an efficientcolored de Bruijn graph index, arising as the combination of a k-mer dictionary with a compressed inverted index. Theproposed index takes full advantage of the fact that unitigs in the colored compacted de Bruijn graph are monochromatic(i.e., all k-mers in a unitig have the same set of references of origin, or color). Specifically, the unitigs are keptin the dictionary in color order, thereby allowing for the encoding of the map from k-mers to their colors in as littleas 1 + o(1) bits per unitig. Hence, one color per unitig is stored in the index with almost no space/time overhead. Bycombining this property with simple but effective compression methods for integer lists, the index achieves verysmall space. We implement these methods in a tool called Fulgor, and conduct an extensive experimental analysisto demonstrate the improvement of our tool over previous solutions. For example, compared to Themisto--thestrongest competitor in terms of index space vs. query time trade-off--Fulgor requires significantly less space (upto 43% less space for a collection of 150,000 Salmonella enterica genomes), is at least twice as fast for color queries,and is 2-6× faster to construct.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2024
			
	Strutture organizzative
	
				Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
			
	Parole chiave
	
				k-mers
Colored compacted de Bruijn graph
Compression
Read-mapping
			
	Appare nelle tipologie:
	
				01.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
prod_492117-doc_205297.pdf accesso aperto Descrizione: Fulgor: a fast and compact k-mer index for large-scale matching and color queries Tipologia: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 2.43 MB Formato Adobe PDF Visualizza/Apri	2.43 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/454752

Citazioni

ND

15

13

social impact