Learned sparse retrieval (LSR) models exhibit varying trade-offs between effectiveness and efficiency. But while standard tools exist for evaluating LSR effectiveness, there is none for evaluating efficiency. Also, datasets with high-quality relevance judgments are too large for repeated efficiency experiments, e.g., on different hardware configurations. To promote the evaluation of LSR models in terms of their effectiveness and efficiency, we introduce the lsr_benchmark, which measures retrieval efficiency at each step of an LSR pipeline (document embedding, indexing, query embedding, and retrieval) as well as its overall effectiveness. To ensure tractability and extensibility, we apply current corpus subsampling methods to eleven TREC tasks, precompute embeddings with eleven LSR models per task, and evaluate eight retrieval engines as baselines. For the benchmark’s hosted version, a modular API, along with tools for evaluating effectiveness and efficiency, facilitates the submission of new approaches. Our experiments show that the chosen embedding model significantly affects the efficiency of a retrieval engine and that LSR is more effective but less efficient than BM25—an efficiency gap that our benchmark now tracks as new LSR models are published.

Evaluating the efficiency and effectiveness of learned sparse retrieval with the lsr_benchmark

Rulli Cosimo;Nardini Franco Maria;
2026

Abstract

Learned sparse retrieval (LSR) models exhibit varying trade-offs between effectiveness and efficiency. But while standard tools exist for evaluating LSR effectiveness, there is none for evaluating efficiency. Also, datasets with high-quality relevance judgments are too large for repeated efficiency experiments, e.g., on different hardware configurations. To promote the evaluation of LSR models in terms of their effectiveness and efficiency, we introduce the lsr_benchmark, which measures retrieval efficiency at each step of an LSR pipeline (document embedding, indexing, query embedding, and retrieval) as well as its overall effectiveness. To ensure tractability and extensibility, we apply current corpus subsampling methods to eleven TREC tasks, precompute embeddings with eleven LSR models per task, and evaluate eight retrieval engines as baselines. For the benchmark’s hosted version, a modular API, along with tools for evaluating effectiveness and efficiency, facilitates the submission of new approaches. Our experiments show that the chosen embedding model significantly affects the efficiency of a retrieval engine and that LSR is more effective but less efficient than BM25—an efficiency gap that our benchmark now tracks as new LSR models are published.
2026
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
9783032213204
9783032213211
Green IR Evaluation
Learned Sparse Retrieval
Neural IR
File in questo prodotto:
File Dimensione Formato  
978-3-032-21321-1_57.pdf

solo utenti autorizzati

Descrizione: Evaluating the Efficiency and Effectiveness of Learned Sparse Retrieval with the lsr_benchmark
Tipologia: Versione Editoriale (PDF)
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 602.51 kB
Formato Adobe PDF
602.51 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/583856
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact