Building lexical resources for low-resource languages, such as Arabic dialects, remains a challenging yet essential endeavor. One major difficulty lies in the reliable identification of appropriate synonyms, which requires both rich lexical data and robust machine learning techniques. This study presents a synset classification framework for Darija (Moroccan Arabic), leveraging contextual embeddings derived from multiple Transformer-based language models to capture the semantic richness of the dialect. In addition to contextual similarity, we automatically extract lexical and ontological similarity features. These features are combined and used as input to supervised classification algorithms. Several classifiers were evaluated, including Logistic Regression, Random Forest, Decision Tree, and Gradient Boosting. The models were trained to predict the most appropriate WordNet synset for each Darija word, with performance assessed through k-fold cross-validation. Experimental results confirm the effectiveness of the proposed approach, with the best-performing model achieving an accuracy of 73.28% and an F1-score of 84.81%, underscoring the potential of Transformer-based embeddings in advancing lexical resource development for under-resourced languages.

Building a Machine Learning Classifier for Synonyms Validation in Moroccan Darija

Nahli O.
Supervision
;
2025

Abstract

Building lexical resources for low-resource languages, such as Arabic dialects, remains a challenging yet essential endeavor. One major difficulty lies in the reliable identification of appropriate synonyms, which requires both rich lexical data and robust machine learning techniques. This study presents a synset classification framework for Darija (Moroccan Arabic), leveraging contextual embeddings derived from multiple Transformer-based language models to capture the semantic richness of the dialect. In addition to contextual similarity, we automatically extract lexical and ontological similarity features. These features are combined and used as input to supervised classification algorithms. Several classifiers were evaluated, including Logistic Regression, Random Forest, Decision Tree, and Gradient Boosting. The models were trained to predict the most appropriate WordNet synset for each Darija word, with performance assessed through k-fold cross-validation. Experimental results confirm the effectiveness of the proposed approach, with the best-performing model achieving an accuracy of 73.28% and an F1-score of 84.81%, underscoring the potential of Transformer-based embeddings in advancing lexical resource development for under-resourced languages.
Campo DC Valore Lingua
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC en
dc.authority.people Belbachir S. en
dc.authority.people Nahli O. en
dc.authority.people El Mohajir M. en
dc.authority.people Chahhou M. en
dc.collection.id.s 71c7200a-7c5f-4e83-8d57-d3d2ba88f40d *
dc.collection.name 04.01 Contributo in Atti di convegno *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.contributor.area Non assegn *
dc.date.accessioned 2026/03/03 17:01:47 -
dc.date.available 2026/03/03 17:01:47 -
dc.date.firstsubmission 2026/02/03 11:37:46 *
dc.date.issued 2025 -
dc.date.submission 2026/02/03 11:37:46 *
dc.description.abstract Building lexical resources for low-resource languages, such as Arabic dialects, remains a challenging yet essential endeavor. One major difficulty lies in the reliable identification of appropriate synonyms, which requires both rich lexical data and robust machine learning techniques. This study presents a synset classification framework for Darija (Moroccan Arabic), leveraging contextual embeddings derived from multiple Transformer-based language models to capture the semantic richness of the dialect. In addition to contextual similarity, we automatically extract lexical and ontological similarity features. These features are combined and used as input to supervised classification algorithms. Several classifiers were evaluated, including Logistic Regression, Random Forest, Decision Tree, and Gradient Boosting. The models were trained to predict the most appropriate WordNet synset for each Darija word, with performance assessed through k-fold cross-validation. Experimental results confirm the effectiveness of the proposed approach, with the best-performing model achieving an accuracy of 73.28% and an F1-score of 84.81%, underscoring the potential of Transformer-based embeddings in advancing lexical resource development for under-resourced languages. -
dc.description.allpeople Belbachir, S.; Nahli, O.; El Mohajir, M.; Chahhou, M. -
dc.description.allpeopleoriginal Belbachir S.; Nahli O.; El Mohajir M.; Chahhou M. en
dc.description.fulltext restricted en
dc.description.numberofauthors 4 -
dc.identifier.doi 10.1109/CiSt65886.2025.11224302 en
dc.identifier.scopus 2-s2.0-105024969055 en
dc.identifier.source scopus *
dc.identifier.uri https://hdl.handle.net/20.500.14243/566029 -
dc.language.iso eng en
dc.relation.firstpage 80 en
dc.relation.ispartofbook 8th IEEE Congress on Information Science and Technology en
dc.relation.lastpage 87 en
dc.relation.numberofpages 8 en
dc.subject.keywordseng Princeton WordNet (PWN);Darija,Natural Language Processing (NLP);Logistic Regression (LR);SUMO;Semantic Similarity;Cosine Similarity;Machine Learning. -
dc.subject.singlekeyword Princeton WordNet (PWN) *
dc.subject.singlekeyword Darija *
dc.subject.singlekeyword Natural Language Processing (NLP) *
dc.subject.singlekeyword Logistic Regression (LR) *
dc.subject.singlekeyword SUMO *
dc.subject.singlekeyword Semantic Similarity *
dc.subject.singlekeyword Cosine Similarity *
dc.subject.singlekeyword Machine Learning. *
dc.title Building a Machine Learning Classifier for Synonyms Validation in Moroccan Darija en
dc.type.driver info:eu-repo/semantics/conferenceObject -
dc.type.full 04 Contributo in convegno::04.01 Contributo in Atti di convegno it
dc.type.miur 273 -
iris.mediafilter.data 2026/03/04 02:52:05 *
iris.orcid.lastModifiedDate 2026/03/03 17:01:47 *
iris.orcid.lastModifiedMillisecond 1772553707435 *
iris.scopus.extIssued 2025 -
iris.scopus.extTitle Building a Machine Learning Classifier for Synonyms Validation in Moroccan Darija -
iris.sitodocente.maxattempts 1 -
iris.unpaywall.doi 10.1109/cist65886.2025.11224302 *
iris.unpaywall.isoa false *
iris.unpaywall.metadataCallLastModified 04/03/2026 04:33:51 -
iris.unpaywall.metadataCallLastModifiedMillisecond 1772595231207 -
iris.unpaywall.oastatus closed *
scopus.category 1711 *
scopus.category 1706 *
scopus.category 1803 *
scopus.category 1802 *
scopus.contributor.affiliation New Technology Trends for Innovation Laboratory -
scopus.contributor.affiliation Instituto di Linguistica Computazionale -
scopus.contributor.affiliation New Technology Trends for Innovation Laboratory -
scopus.contributor.affiliation New Technology Trends for Innovation Laboratory -
scopus.contributor.afid 60025506 -
scopus.contributor.afid 60021199 -
scopus.contributor.afid 60025506 -
scopus.contributor.afid 60025506 -
scopus.contributor.auid 60241555200 -
scopus.contributor.auid 56741333300 -
scopus.contributor.auid 60017115600 -
scopus.contributor.auid 36801152800 -
scopus.contributor.country Morocco -
scopus.contributor.country Italy -
scopus.contributor.country Morocco -
scopus.contributor.country Morocco -
scopus.contributor.dptid 104292580 -
scopus.contributor.dptid -
scopus.contributor.dptid 104292580 -
scopus.contributor.dptid 104292580 -
scopus.contributor.name Said -
scopus.contributor.name Ouafae -
scopus.contributor.name Mohammed -
scopus.contributor.name Mohamed -
scopus.contributor.subaffiliation Faculty of Sciences;Abdelmalek Essaadi University; -
scopus.contributor.subaffiliation Consiglio Nazionale Delle Ricerche; -
scopus.contributor.subaffiliation Faculty of Sciences;Abdelmalek Essaadi University; -
scopus.contributor.subaffiliation Faculty of Sciences;Abdelmalek Essaadi University; -
scopus.contributor.surname Belbachir -
scopus.contributor.surname Nahli -
scopus.contributor.surname El Mohajir -
scopus.contributor.surname Chahhou -
scopus.date.issued 2025 *
scopus.description.abstract Building lexical resources for low-resource languages, such as Arabic dialects, remains a challenging yet essential endeavor. One major difficulty lies in the reliable identification of appropriate synonyms, which requires both rich lexical data and robust machine learning techniques. This study presents a synset classification framework for Darija (Moroccan Arabic), leveraging contextual embeddings derived from multiple Transformer-based language models to capture the semantic richness of the dialect. In addition to contextual similarity, we automatically extract lexical and ontological similarity features. These features are combined and used as input to supervised classification algorithms. Several classifiers were evaluated, including Logistic Regression, Random Forest, Decision Tree, and Gradient Boosting. The models were trained to predict the most appropriate WordNet synset for each Darija word, with performance assessed through k-fold cross-validation. Experimental results confirm the effectiveness of the proposed approach, with the best-performing model achieving an accuracy of 73.28% and an F1-score of 84.81%, underscoring the potential of Transformer-based embeddings in advancing lexical resource development for under-resourced languages. *
scopus.description.allpeopleoriginal Belbachir S.; Nahli O.; El Mohajir M.; Chahhou M. *
scopus.differences scopus.publisher.name *
scopus.differences scopus.subject.keywords *
scopus.differences scopus.relation.conferencedate *
scopus.differences scopus.relation.conferencename *
scopus.differences scopus.identifier.isbn *
scopus.differences scopus.relation.conferenceplace *
scopus.document.type cp *
scopus.document.types cp *
scopus.funding.funders 501100000780 - European Commission; *
scopus.funding.ids GA 101086252; *
scopus.identifier.doi 10.1109/CiSt65886.2025.11224302 *
scopus.identifier.eissn 2327-1884 *
scopus.identifier.isbn 9798331543846 *
scopus.identifier.pui 649555984 *
scopus.identifier.scopus 2-s2.0-105024969055 *
scopus.journal.sourceid 21100400809 *
scopus.language.iso eng *
scopus.publisher.name Institute of Electrical and Electronics Engineers Inc. *
scopus.relation.conferencedate 2025 *
scopus.relation.conferencename 8th IEEE International Congress on Information Science and Technology, CiSt 2025 *
scopus.relation.conferenceplace mar *
scopus.relation.firstpage 80 *
scopus.relation.lastpage 87 *
scopus.subject.keywords Cosine Similarity; Darija; Logistic Regression (LR); Machine Learning; Natural Language Processing (NLP); Princeton WordNet (PWN); Semantic Similarity; SUMO; *
scopus.title Building a Machine Learning Classifier for Synonyms Validation in Moroccan Darija *
scopus.titleeng Building a Machine Learning Classifier for Synonyms Validation in Moroccan Darija *
Appare nelle tipologie: 04.01 Contributo in Atti di convegno
File in questo prodotto:
File Dimensione Formato  
Building_a_Machine_Learning_Classifier_for_Synonyms_Validation_in_Moroccan_Darija.pdf

solo utenti autorizzati

Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 1.16 MB
Formato Adobe PDF
1.16 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/566029
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact