The linguistic diversity of the Italian peninsula and its islands, characterized by several language varieties, represents a linguistic condition and a cultural treasure unique in Europe. However, the oral nature of these varieties poses a challenge to their preservation in the written form. While significant research efforts have been dedicated to standard Italian language processing, less attention has been given to the language varieties of Italy and the development of supporting resources. This paper aims to study the peculiarities of language varieties of Italy and identify the region of origin of tweets written in non-[Standard Italian] varieties. To achieve this goal, we utilized two main techniques: fine-tuning a language model (BERT) and implementing an algorithm that utilizes dictionaries of regional varieties and word frequency. Our results show that integrating lexical analysis with BERT could be a promising approach for this particular task. We present an overview of the data, methodology, and evaluation results, then discuss the implications of our findings.

Galliz at GeoLingIt: enhancing BERT with vocabulary knowledge for predicting the region of language varieties of Italy

Gallo S
2023

Abstract

The linguistic diversity of the Italian peninsula and its islands, characterized by several language varieties, represents a linguistic condition and a cultural treasure unique in Europe. However, the oral nature of these varieties poses a challenge to their preservation in the written form. While significant research efforts have been dedicated to standard Italian language processing, less attention has been given to the language varieties of Italy and the development of supporting resources. This paper aims to study the peculiarities of language varieties of Italy and identify the region of origin of tweets written in non-[Standard Italian] varieties. To achieve this goal, we utilized two main techniques: fine-tuning a language model (BERT) and implementing an algorithm that utilizes dictionaries of regional varieties and word frequency. Our results show that integrating lexical analysis with BERT could be a promising approach for this particular task. We present an overview of the data, methodology, and evaluation results, then discuss the implications of our findings.
2023
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
Natural Language Processing
Language varieties
Tweets classification
File in questo prodotto:
File Dimensione Formato  
prod_489916-doc_204414.pdf

accesso aperto

Descrizione: Galliz at GeoLingIt: enhancing BERT with vocabulary knowledge for predicting the region of language varieties of Italy
Tipologia: Versione Editoriale (PDF)
Dimensione 534.89 kB
Formato Adobe PDF
534.89 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/451787
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact