Building lexical resources for low-resource languages, such as Arabic dialects, remains a challenging yet essential endeavor. One major difficulty lies in the reliable identification of appropriate synonyms, which requires both rich lexical data and robust machine learning techniques. This study presents a synset classification framework for Darija (Moroccan Arabic), leveraging contextual embeddings derived from multiple Transformer-based language models to capture the semantic richness of the dialect. In addition to contextual similarity, we automatically extract lexical and ontological similarity features. These features are combined and used as input to supervised classification algorithms. Several classifiers were evaluated, including Logistic Regression, Random Forest, Decision Tree, and Gradient Boosting. The models were trained to predict the most appropriate WordNet synset for each Darija word, with performance assessed through k-fold cross-validation. Experimental results confirm the effectiveness of the proposed approach, with the best-performing model achieving an accuracy of 73.28% and an F1-score of 84.81%, underscoring the potential of Transformer-based embeddings in advancing lexical resource development for under-resourced languages.

Building a Machine Learning Classifier for Synonyms Validation in Moroccan Darija

Nahli O.
Supervision
;
2025

Abstract

Building lexical resources for low-resource languages, such as Arabic dialects, remains a challenging yet essential endeavor. One major difficulty lies in the reliable identification of appropriate synonyms, which requires both rich lexical data and robust machine learning techniques. This study presents a synset classification framework for Darija (Moroccan Arabic), leveraging contextual embeddings derived from multiple Transformer-based language models to capture the semantic richness of the dialect. In addition to contextual similarity, we automatically extract lexical and ontological similarity features. These features are combined and used as input to supervised classification algorithms. Several classifiers were evaluated, including Logistic Regression, Random Forest, Decision Tree, and Gradient Boosting. The models were trained to predict the most appropriate WordNet synset for each Darija word, with performance assessed through k-fold cross-validation. Experimental results confirm the effectiveness of the proposed approach, with the best-performing model achieving an accuracy of 73.28% and an F1-score of 84.81%, underscoring the potential of Transformer-based embeddings in advancing lexical resource development for under-resourced languages.
2025
Istituto di linguistica computazionale "Antonio Zampolli" - ILC
Princeton WordNet (PWN);Darija,Natural Language Processing (NLP);Logistic Regression (LR);SUMO;Semantic Similarity;Cosine Similarity;Machine Learning.
File in questo prodotto:
File Dimensione Formato  
Building_a_Machine_Learning_Classifier_for_Synonyms_Validation_in_Moroccan_Darija.pdf

solo utenti autorizzati

Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 1.16 MB
Formato Adobe PDF
1.16 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/566029
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact