Social media offer a big amount of information, to exploit in many fields of research. However, while methods for Natural Language Processing are being developed with good results when applied to well-formed datasets made of written text with a clear syntax, these sources present text written in informal language, unstructured syntax, and with peculiar symbols; therefore, particular approaches are required for text processing in this case. In this paper, the task of sentiment analysis of tweets is regarded. In particular, in order to avoid noise constituted by some web constructs like URLs and mentions and by other text fragments, and to exploit information hidden in symbols like emoticons, emojis and hashtags, the pre-processing of tweets is analyzed. More in detail, a number of experiments, performed by a state-of-the-art classification model (BERT), are designed, to evaluate many currently available operations for pre-processing tweets, in terms of the statistical significance of their influence on sentiment analysis performances. Moreover, available data in two languages are considered, i.e., English and Italian, in order to also evaluate dependence on the language. Results allow to individuate the most convenient strategy to pre-process tweets, and thus to improve the state of the art in both languages for the considered task of sentiment analysis.
Multilingual evaluation of pre-processing for BERT-based sentiment analysis of tweets
Pota M;Esposito M
2021
Abstract
Social media offer a big amount of information, to exploit in many fields of research. However, while methods for Natural Language Processing are being developed with good results when applied to well-formed datasets made of written text with a clear syntax, these sources present text written in informal language, unstructured syntax, and with peculiar symbols; therefore, particular approaches are required for text processing in this case. In this paper, the task of sentiment analysis of tweets is regarded. In particular, in order to avoid noise constituted by some web constructs like URLs and mentions and by other text fragments, and to exploit information hidden in symbols like emoticons, emojis and hashtags, the pre-processing of tweets is analyzed. More in detail, a number of experiments, performed by a state-of-the-art classification model (BERT), are designed, to evaluate many currently available operations for pre-processing tweets, in terms of the statistical significance of their influence on sentiment analysis performances. Moreover, available data in two languages are considered, i.e., English and Italian, in order to also evaluate dependence on the language. Results allow to individuate the most convenient strategy to pre-process tweets, and thus to improve the state of the art in both languages for the considered task of sentiment analysis.File | Dimensione | Formato | |
---|---|---|---|
prod_458809-doc_178471.pdf
solo utenti autorizzati
Descrizione: Multilingual evaluation of pre-processing for BERT-based sentiment analysis of tweets
Tipologia:
Versione Editoriale (PDF)
Licenza:
Nessuna licenza dichiarata (non attribuibile a prodotti successivi al 2023)
Dimensione
589.92 kB
Formato
Adobe PDF
|
589.92 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.