Increasingly, the web produces massive volumes of texts, alone or associated with images, videos, photographs, together with some metadata, indispensable for their finding and retrieval. Keywords/keyphrases that characterize the semantic content of documents should be, automatically or manually, extracted, and/or associated with them. The paper presents a novel method to address the problem of the automatic unsupervised extraction of keywords/phrases from texts, expressed both in English and in Italian. The main feature of this approach is the integration of two methods that have given interesting results: word embedding models, such as Word2Vec or GloVe able to capture the semantics of words and their context, and clustering algorithms, able to identify the essence of the terms and choose the more significant one(s), to represent the contents of a text. In the paper, the datasets used are presented, together with the method implemented and the results obtained. These results will be discussed, commented, and compared with those obtained in previous experimentations, using TextRank, Rapid Automatic Keyword Extraction (RAKE), and TF-IDF.

Semantic unsupervised automatic keyphrases extraction by integrating word embedding with clustering methods

I. Gagliardi
;
M. T. Artese
2020

Abstract

Increasingly, the web produces massive volumes of texts, alone or associated with images, videos, photographs, together with some metadata, indispensable for their finding and retrieval. Keywords/keyphrases that characterize the semantic content of documents should be, automatically or manually, extracted, and/or associated with them. The paper presents a novel method to address the problem of the automatic unsupervised extraction of keywords/phrases from texts, expressed both in English and in Italian. The main feature of this approach is the integration of two methods that have given interesting results: word embedding models, such as Word2Vec or GloVe able to capture the semantics of words and their context, and clustering algorithms, able to identify the essence of the terms and choose the more significant one(s), to represent the contents of a text. In the paper, the datasets used are presented, together with the method implemented and the results obtained. These results will be discussed, commented, and compared with those obtained in previous experimentations, using TextRank, Rapid Automatic Keyword Extraction (RAKE), and TF-IDF.
2020
Istituto di Matematica Applicata e Tecnologie Informatiche - IMATI - Sede Secondaria Milano
unsupervised automatic keyword extraction
word embedding models
Italian datasets
clustering algorithms
glove
w2v
information retrieval
File in questo prodotto:
File Dimensione Formato  
prod_433565-doc_154813.pdf

accesso aperto

Descrizione: Semantic Unsupervised Automatic Keyphrases Extraction by Integrating Word Embedding with Clustering Methods
Tipologia: Versione Editoriale (PDF)
Licenza: Creative commons
Dimensione 3.07 MB
Formato Adobe PDF
3.07 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/382142
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 20
  • ???jsp.display-item.citation.isi??? 11
social impact