CNR Institutional Research Information System

Historical dictionaries are increasingly reused as sources for diachronic language corpora. In this context, lexicographic quotations represent a valuable yet challenging type of data, as they are both editorially curated and diachronically representative. A major issue in their computational reuse is the presence of duplicate and nearduplicate quotations. This paper addresses quotation deduplication in corpora derived from lexicographic resources. We introduce QRD (Quotation Reuse Detection), a multi-stage pipeline designed to identify, compare, and cluster quotations based on graded similarity rather than binary matching. The approach combines string-based similarity measures, iterative threshold analysis, and clustering, enabling both quantitative and qualitative investigation of quotation reuse. Our results show that deduplication in this context cannot be reduced to the automatic elimination of redundant data. The variability observed in the quotations - ranging from OCR-related noise to substantial editorial variation - reflects both technical and structural factors and calls for a more nuanced approach. QRD supports the identification of OCR-related errors and reveals patterns of textual reuse underlying the compilation of the dictionary. We argue that quotation deduplication should be conceived primarily as a task of identification and clustering. This perspective reframes deduplication from a data-cleaning operation into an analytical methodology for historically and editorially curated textual resources.

When Lexicographic Quotations Become a Corpus: To Deduplicate or Not to Deduplicate?

Manuel Favaro;Elisa Guadagnini;Eva Sassolini;Marco Biffi;Simonetta Montemagni

2026

Abstract

Historical dictionaries are increasingly reused as sources for diachronic language corpora. In this context, lexicographic quotations represent a valuable yet challenging type of data, as they are both editorially curated and diachronically representative. A major issue in their computational reuse is the presence of duplicate and nearduplicate quotations. This paper addresses quotation deduplication in corpora derived from lexicographic resources. We introduce QRD (Quotation Reuse Detection), a multi-stage pipeline designed to identify, compare, and cluster quotations based on graded similarity rather than binary matching. The approach combines string-based similarity measures, iterative threshold analysis, and clustering, enabling both quantitative and qualitative investigation of quotation reuse. Our results show that deduplication in this context cannot be reduced to the automatic elimination of redundant data. The variability observed in the quotations - ranging from OCR-related noise to substantial editorial variation - reflects both technical and structural factors and calls for a more nuanced approach. QRD supports the identification of OCR-related errors and reveals patterns of textual reuse underlying the compilation of the dictionary. We argue that quotation deduplication should be conceived primarily as a task of identification and clustering. This perspective reframes deduplication from a data-cleaning operation into an analytical methodology for historically and editorially curated textual resources.

Scheda breve

Scheda completa

Scheda completa (DC)

Campo DC	Valore	Lingua
dc.authority.orgunit	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	en
dc.authority.people	Manuel Favaro	en
dc.authority.people	Elisa Guadagnini	en
dc.authority.people	Eva Sassolini	en
dc.authority.people	Marco Biffi	en
dc.authority.people	Simonetta Montemagni	en
dc.collection.id.s	71c7200a-7c5f-4e83-8d57-d3d2ba88f40d	*
dc.collection.name	04.01 Contributo in Atti di convegno	*
dc.contributor.appartenenza	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	*
dc.contributor.appartenenza.mi	918	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.date.accessioned	2026/07/03 16:50:13	-
dc.date.available	2026/07/03 16:50:13	-
dc.date.firstsubmission	2026/05/11 14:41:38	*
dc.date.issued	2026	-
dc.date.submission	2026/06/22 12:26:01	*
dc.description.abstracteng	Historical dictionaries are increasingly reused as sources for diachronic language corpora. In this context, lexicographic quotations represent a valuable yet challenging type of data, as they are both editorially curated and diachronically representative. A major issue in their computational reuse is the presence of duplicate and nearduplicate quotations. This paper addresses quotation deduplication in corpora derived from lexicographic resources. We introduce QRD (Quotation Reuse Detection), a multi-stage pipeline designed to identify, compare, and cluster quotations based on graded similarity rather than binary matching. The approach combines string-based similarity measures, iterative threshold analysis, and clustering, enabling both quantitative and qualitative investigation of quotation reuse. Our results show that deduplication in this context cannot be reduced to the automatic elimination of redundant data. The variability observed in the quotations - ranging from OCR-related noise to substantial editorial variation - reflects both technical and structural factors and calls for a more nuanced approach. QRD supports the identification of OCR-related errors and reveals patterns of textual reuse underlying the compilation of the dictionary. We argue that quotation deduplication should be conceived primarily as a task of identification and clustering. This perspective reframes deduplication from a data-cleaning operation into an analytical methodology for historically and editorially curated textual resources.	-
dc.description.allpeople	Favaro, Manuel; Guadagnini, Elisa; Sassolini, Eva; Biffi, Marco; Montemagni, Simonetta	-
dc.description.allpeopleoriginal	Manuel Favaro, Elisa Guadagnini, Eva Sassolini, Marco Biffi, Simonetta Montemagni	en
dc.description.fulltext	open	en
dc.description.international	no	en
dc.description.numberofauthors	5	-
dc.identifier.isbn	9782493814586	en
dc.identifier.source	manual	*
dc.identifier.uri	https://hdl.handle.net/20.500.14243/580324	-
dc.language.iso	eng	en
dc.publisher.name	ELRA Language Resources Association	en
dc.relation.allauthors	Marco Passarotti, Rachele Sprugnoli	en
dc.relation.conferencedate	11 maggio 2026	en
dc.relation.conferencename	Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA 2026)	en
dc.relation.conferenceplace	Palma, Mallorca (Spagna)	en
dc.relation.ispartofbook	Fourth Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA 2026) @ LREC 2026	en
dc.relation.medium	ELETTRONICO	en
dc.subject.keywordseng	Historical Corpora, Text Deduplication, Data Matching Process, Historical Lexicography	-
dc.subject.singlekeyword	Historical Corpora	*
dc.subject.singlekeyword	Text Deduplication	*
dc.subject.singlekeyword	Data Matching Process	*
dc.subject.singlekeyword	Historical Lexicography	*
dc.title	When Lexicographic Quotations Become a Corpus: To Deduplicate or Not to Deduplicate?	en
dc.type.circulation	Internazionale	en
dc.type.driver	info:eu-repo/semantics/conferenceObject	-
dc.type.full	04 Contributo in convegno::04.01 Contributo in Atti di convegno	it
dc.type.impactfactor	si	en
dc.type.miur	273	-
dc.type.referee	Esperti anonimi	en
iris.mediafilter.data	2026/07/04 02:29:02	*
iris.orcid.lastModifiedDate	2026/07/03 16:50:13	*
iris.orcid.lastModifiedMillisecond	1783090213500	*
iris.sitodocente.maxattempts	1	-
Appare nelle tipologie:	04.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
2026 FavaroEtAlii - Deduplication (LT4HALA LREC2026).pdf accesso aperto Descrizione: Manuel Favaro, Elisa Guadagnini, Eva Sassolini, Marco Biffi e Simonetta Montemagni, When Lexicographic Quotations Become a Corpus: To Deduplicate or Not to Deduplicate? Tipologia: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 689.03 kB Formato Adobe PDF Visualizza/Apri	689.03 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/580324

Citazioni

ND

ND

ND

social impact