Building a Machine Learning Classifier for Synonyms Validation in Moroccan Darija

Belbachir, S.; Nahli, O.; El Mohajir, M.; Chahhou, M.

doi:10.1109/CiSt65886.2025.11224302

Building lexical resources for low-resource languages, such as Arabic dialects, remains a challenging yet essential endeavor. One major difficulty lies in the reliable identification of appropriate synonyms, which requires both rich lexical data and robust machine learning techniques. This study presents a synset classification framework for Darija (Moroccan Arabic), leveraging contextual embeddings derived from multiple Transformer-based language models to capture the semantic richness of the dialect. In addition to contextual similarity, we automatically extract lexical and ontological similarity features. These features are combined and used as input to supervised classification algorithms. Several classifiers were evaluated, including Logistic Regression, Random Forest, Decision Tree, and Gradient Boosting. The models were trained to predict the most appropriate WordNet synset for each Darija word, with performance assessed through k-fold cross-validation. Experimental results confirm the effectiveness of the proposed approach, with the best-performing model achieving an accuracy of 73.28% and an F1-score of 84.81%, underscoring the potential of Transformer-based embeddings in advancing lexical resource development for under-resourced languages.