Social media offer a big amount of information, to exploit in many fields of research. However, while methods for Natural Language Processing are being developed with good results when applied to well-formed datasets made of written text with a clear syntax, these sources present text written in informal language, unstructured syntax, and with peculiar symbols; therefore, particular approaches are required for text processing in this case. In this paper, the task of sentiment analysis of tweets is regarded. In particular, in order to avoid noise constituted by some web constructs like URLs and mentions and by other text fragments, and to exploit information hidden in symbols like emoticons, emojis and hashtags, the pre-processing of tweets is analyzed. More in detail, a number of experiments, performed by a state-of-the-art classification model (BERT), are designed, to evaluate many currently available operations for pre-processing tweets, in terms of the statistical significance of their influence on sentiment analysis performances. Moreover, available data in two languages are considered, i.e., English and Italian, in order to also evaluate dependence on the language. Results allow to individuate the most convenient strategy to pre-process tweets, and thus to improve the state of the art in both languages for the considered task of sentiment analysis.

Multilingual evaluation of pre-processing for BERT-based sentiment analysis of tweets

Pota M;Esposito M
2021

Abstract

Social media offer a big amount of information, to exploit in many fields of research. However, while methods for Natural Language Processing are being developed with good results when applied to well-formed datasets made of written text with a clear syntax, these sources present text written in informal language, unstructured syntax, and with peculiar symbols; therefore, particular approaches are required for text processing in this case. In this paper, the task of sentiment analysis of tweets is regarded. In particular, in order to avoid noise constituted by some web constructs like URLs and mentions and by other text fragments, and to exploit information hidden in symbols like emoticons, emojis and hashtags, the pre-processing of tweets is analyzed. More in detail, a number of experiments, performed by a state-of-the-art classification model (BERT), are designed, to evaluate many currently available operations for pre-processing tweets, in terms of the statistical significance of their influence on sentiment analysis performances. Moreover, available data in two languages are considered, i.e., English and Italian, in order to also evaluate dependence on the language. Results allow to individuate the most convenient strategy to pre-process tweets, and thus to improve the state of the art in both languages for the considered task of sentiment analysis.
2021
Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
sentiment analysis
pre-processing
twitter
english
italian
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/429201
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 64
  • ???jsp.display-item.citation.isi??? ND
social impact