CNR Institutional Research Information System

When integrating data from different sources, there are problems of synonymy, different languages, concepts of different granularity. This paper proposes a simple but effective approach to evaluate the semantic similarity of short texts, especially keywords. The method is capable of matching keywords from different sources and languages by exploiting transformers and WordNet-based methods. Key features of the approach include its unsupervised pipeline, mitigation of the lack of context in keywords, scalability for large archives, support for multiple languages and real-world scenarios adaptation capabilities. The work aims to provide a versatile tool for different cultural heritage archives without requiring complex customization. The objectives of the paper are to explore different approaches to identifying similarities in 1- or n-gram tags, to evaluate and compare different pre-trained language models, and to define integrated methods to overcome limitations. Tests to validate the approach have been conducted using the QueryLab portal, a search engine for cultural heritage archives, to evaluate the proposed pipeline.

Ensemble-based short text similarity: An easy approach for multilingual datasets using transformers and WordNet in real-world scenarios

I Gagliardi;MT Artese

2023

Abstract

When integrating data from different sources, there are problems of synonymy, different languages, concepts of different granularity. This paper proposes a simple but effective approach to evaluate the semantic similarity of short texts, especially keywords. The method is capable of matching keywords from different sources and languages by exploiting transformers and WordNet-based methods. Key features of the approach include its unsupervised pipeline, mitigation of the lack of context in keywords, scalability for large archives, support for multiple languages and real-world scenarios adaptation capabilities. The work aims to provide a versatile tool for different cultural heritage archives without requiring complex customization. The objectives of the paper are to explore different approaches to identifying similarities in 1- or n-gram tags, to evaluate and compare different pre-trained language models, and to define integrated methods to overcome limitations. Tests to validate the approach have been conducted using the QueryLab portal, a search engine for cultural heritage archives, to evaluate the proposed pipeline.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2023
			
	Strutture organizzative
	
				Istituto di Matematica Applicata e Tecnologie Informatiche - IMATI - Sede Secondaria Milano
			
	Parole chiave
	
				semantic textual similarity
pretrained language models
transformers
WordNet
QueryLab
ensemble methods
			
	Appare nelle tipologie:
	
				01.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
BDCC-07-00158.pdf accesso aperto Descrizione: Ensemble-Based Short Text Similarity: An Easy Approach for Multilingual Datasets Using Transformers and WordNet in Real-World Scenarios Tipologia: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 16.45 MB Formato Adobe PDF Visualizza/Apri	16.45 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/461926

Citazioni

ND

6

5

social impact