CNR Institutional Research Information System

Building lexical resources for low-resource languages, such as Arabic dialects, remains a challenging yet essential endeavor. One major difficulty lies in the reliable identification of appropriate synonyms, which requires both rich lexical data and robust machine learning techniques. This study presents a synset classification framework for Darija (Moroccan Arabic), leveraging contextual embeddings derived from multiple Transformer-based language models to capture the semantic richness of the dialect. In addition to contextual similarity, we automatically extract lexical and ontological similarity features. These features are combined and used as input to supervised classification algorithms. Several classifiers were evaluated, including Logistic Regression, Random Forest, Decision Tree, and Gradient Boosting. The models were trained to predict the most appropriate WordNet synset for each Darija word, with performance assessed through k-fold cross-validation. Experimental results confirm the effectiveness of the proposed approach, with the best-performing model achieving an accuracy of 73.28% and an F1-score of 84.81%, underscoring the potential of Transformer-based embeddings in advancing lexical resource development for under-resourced languages.

Building a Machine Learning Classifier for Synonyms Validation in Moroccan Darija

Belbachir S.^Methodology;Nahli O.^Supervision;El Mohajir M.^Supervision;Chahhou M.^{Conceptualization}

2025

Abstract

Building lexical resources for low-resource languages, such as Arabic dialects, remains a challenging yet essential endeavor. One major difficulty lies in the reliable identification of appropriate synonyms, which requires both rich lexical data and robust machine learning techniques. This study presents a synset classification framework for Darija (Moroccan Arabic), leveraging contextual embeddings derived from multiple Transformer-based language models to capture the semantic richness of the dialect. In addition to contextual similarity, we automatically extract lexical and ontological similarity features. These features are combined and used as input to supervised classification algorithms. Several classifiers were evaluated, including Logistic Regression, Random Forest, Decision Tree, and Gradient Boosting. The models were trained to predict the most appropriate WordNet synset for each Darija word, with performance assessed through k-fold cross-validation. Experimental results confirm the effectiveness of the proposed approach, with the best-performing model achieving an accuracy of 73.28% and an F1-score of 84.81%, underscoring the potential of Transformer-based embeddings in advancing lexical resource development for under-resourced languages.

Scheda breve

Scheda completa

Scheda completa (DC)

Campo DC	Valore	Lingua
dc.authority.orgunit	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	en
dc.authority.people	Belbachir S.	en
dc.authority.people	Nahli O.	en
dc.authority.people	El Mohajir M.	en
dc.authority.people	Chahhou M.	en
dc.collection.id.s	71c7200a-7c5f-4e83-8d57-d3d2ba88f40d	*
dc.collection.name	04.01 Contributo in Atti di convegno	*
dc.contributor.appartenenza	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	*
dc.contributor.appartenenza.mi	918	*
dc.contributor.area	Non assegn	*
dc.date.accessioned	2026/03/03 17:01:47	-
dc.date.available	2026/03/03 17:01:47	-
dc.date.firstsubmission	2026/02/03 11:37:46	*
dc.date.issued	2025	-
dc.date.submission	2026/02/03 11:37:46	*
dc.description.abstract	Building lexical resources for low-resource languages, such as Arabic dialects, remains a challenging yet essential endeavor. One major difficulty lies in the reliable identification of appropriate synonyms, which requires both rich lexical data and robust machine learning techniques. This study presents a synset classification framework for Darija (Moroccan Arabic), leveraging contextual embeddings derived from multiple Transformer-based language models to capture the semantic richness of the dialect. In addition to contextual similarity, we automatically extract lexical and ontological similarity features. These features are combined and used as input to supervised classification algorithms. Several classifiers were evaluated, including Logistic Regression, Random Forest, Decision Tree, and Gradient Boosting. The models were trained to predict the most appropriate WordNet synset for each Darija word, with performance assessed through k-fold cross-validation. Experimental results confirm the effectiveness of the proposed approach, with the best-performing model achieving an accuracy of 73.28% and an F1-score of 84.81%, underscoring the potential of Transformer-based embeddings in advancing lexical resource development for under-resourced languages.	-
dc.description.allpeople	Belbachir, S.; Nahli, O.; El Mohajir, M.; Chahhou, M.	-
dc.description.allpeopleoriginal	Belbachir S.; Nahli O.; El Mohajir M.; Chahhou M.	en
dc.description.fulltext	restricted	en
dc.description.numberofauthors	4	-
dc.identifier.doi	10.1109/CiSt65886.2025.11224302	en
dc.identifier.scopus	2-s2.0-105024969055	en
dc.identifier.source	scopus	*
dc.identifier.uri	https://hdl.handle.net/20.500.14243/566029	-
dc.language.iso	eng	en
dc.relation.firstpage	80	en
dc.relation.ispartofbook	8th IEEE Congress on Information Science and Technology	en
dc.relation.lastpage	87	en
dc.relation.numberofpages	8	en
dc.subject.keywordseng	Princeton WordNet (PWN);Darija,Natural Language Processing (NLP);Logistic Regression (LR);SUMO;Semantic Similarity;Cosine Similarity;Machine Learning.	-
dc.subject.singlekeyword	Princeton WordNet (PWN)	*
dc.subject.singlekeyword	Darija	*
dc.subject.singlekeyword	Natural Language Processing (NLP)	*
dc.subject.singlekeyword	Logistic Regression (LR)	*
dc.subject.singlekeyword	SUMO	*
dc.subject.singlekeyword	Semantic Similarity	*
dc.subject.singlekeyword	Cosine Similarity	*
dc.subject.singlekeyword	Machine Learning.	*
dc.title	Building a Machine Learning Classifier for Synonyms Validation in Moroccan Darija	en
dc.type.driver	info:eu-repo/semantics/conferenceObject	-
dc.type.full	04 Contributo in convegno::04.01 Contributo in Atti di convegno	it
dc.type.miur	273	-
iris.mediafilter.data	2026/03/04 02:52:05	*
iris.orcid.lastModifiedDate	2026/03/03 17:01:47	*
iris.orcid.lastModifiedMillisecond	1772553707435	*
iris.scopus.extIssued	2025	-
iris.scopus.extTitle	Building a Machine Learning Classifier for Synonyms Validation in Moroccan Darija	-
iris.sitodocente.maxattempts	1	-
iris.unpaywall.doi	10.1109/cist65886.2025.11224302	*
iris.unpaywall.isoa	false	*
iris.unpaywall.metadataCallLastModified	04/03/2026 04:33:51	-
iris.unpaywall.metadataCallLastModifiedMillisecond	1772595231207	-
iris.unpaywall.oastatus	closed	*
scopus.category	1711	*
scopus.category	1706	*
scopus.category	1803	*
scopus.category	1802	*
scopus.contributor.affiliation	New Technology Trends for Innovation Laboratory	-
scopus.contributor.affiliation	Instituto di Linguistica Computazionale	-
scopus.contributor.affiliation	New Technology Trends for Innovation Laboratory	-
scopus.contributor.affiliation	New Technology Trends for Innovation Laboratory	-
scopus.contributor.afid	60025506	-
scopus.contributor.afid	60021199	-
scopus.contributor.afid	60025506	-
scopus.contributor.afid	60025506	-
scopus.contributor.auid	60241555200	-
scopus.contributor.auid	56741333300	-
scopus.contributor.auid	60017115600	-
scopus.contributor.auid	36801152800	-
scopus.contributor.country	Morocco	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Morocco	-
scopus.contributor.country	Morocco	-
scopus.contributor.dptid	104292580	-
scopus.contributor.dptid		-
scopus.contributor.dptid	104292580	-
scopus.contributor.dptid	104292580	-
scopus.contributor.name	Said	-
scopus.contributor.name	Ouafae	-
scopus.contributor.name	Mohammed	-
scopus.contributor.name	Mohamed	-
scopus.contributor.subaffiliation	Faculty of Sciences;Abdelmalek Essaadi University;	-
scopus.contributor.subaffiliation	Consiglio Nazionale Delle Ricerche;	-
scopus.contributor.subaffiliation	Faculty of Sciences;Abdelmalek Essaadi University;	-
scopus.contributor.subaffiliation	Faculty of Sciences;Abdelmalek Essaadi University;	-
scopus.contributor.surname	Belbachir	-
scopus.contributor.surname	Nahli	-
scopus.contributor.surname	El Mohajir	-
scopus.contributor.surname	Chahhou	-
scopus.date.issued	2025	*
scopus.description.abstract	Building lexical resources for low-resource languages, such as Arabic dialects, remains a challenging yet essential endeavor. One major difficulty lies in the reliable identification of appropriate synonyms, which requires both rich lexical data and robust machine learning techniques. This study presents a synset classification framework for Darija (Moroccan Arabic), leveraging contextual embeddings derived from multiple Transformer-based language models to capture the semantic richness of the dialect. In addition to contextual similarity, we automatically extract lexical and ontological similarity features. These features are combined and used as input to supervised classification algorithms. Several classifiers were evaluated, including Logistic Regression, Random Forest, Decision Tree, and Gradient Boosting. The models were trained to predict the most appropriate WordNet synset for each Darija word, with performance assessed through k-fold cross-validation. Experimental results confirm the effectiveness of the proposed approach, with the best-performing model achieving an accuracy of 73.28% and an F1-score of 84.81%, underscoring the potential of Transformer-based embeddings in advancing lexical resource development for under-resourced languages.	*
scopus.description.allpeopleoriginal	Belbachir S.; Nahli O.; El Mohajir M.; Chahhou M.	*
scopus.differences	scopus.publisher.name	*
scopus.differences	scopus.subject.keywords	*
scopus.differences	scopus.relation.conferencedate	*
scopus.differences	scopus.relation.conferencename	*
scopus.differences	scopus.identifier.isbn	*
scopus.differences	scopus.relation.conferenceplace	*
scopus.document.type	cp	*
scopus.document.types	cp	*
scopus.funding.funders	501100000780 - European Commission;	*
scopus.funding.ids	GA 101086252;	*
scopus.identifier.doi	10.1109/CiSt65886.2025.11224302	*
scopus.identifier.eissn	2327-1884	*
scopus.identifier.isbn	9798331543846	*
scopus.identifier.pui	649555984	*
scopus.identifier.scopus	2-s2.0-105024969055	*
scopus.journal.sourceid	21100400809	*
scopus.language.iso	eng	*
scopus.publisher.name	Institute of Electrical and Electronics Engineers Inc.	*
scopus.relation.conferencedate	2025	*
scopus.relation.conferencename	8th IEEE International Congress on Information Science and Technology, CiSt 2025	*
scopus.relation.conferenceplace	mar	*
scopus.relation.firstpage	80	*
scopus.relation.lastpage	87	*
scopus.subject.keywords	Cosine Similarity; Darija; Logistic Regression (LR); Machine Learning; Natural Language Processing (NLP); Princeton WordNet (PWN); Semantic Similarity; SUMO;	*
scopus.title	Building a Machine Learning Classifier for Synonyms Validation in Moroccan Darija	*
scopus.titleeng	Building a Machine Learning Classifier for Synonyms Validation in Moroccan Darija	*
Appare nelle tipologie:	04.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
Building_a_Machine_Learning_Classifier_for_Synonyms_Validation_in_Moroccan_Darija.pdf solo utenti autorizzati Licenza: NON PUBBLICO - Accesso privato/ristretto Dimensione 1.16 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.16 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/566029

Citazioni

ND

0

ND

social impact