Building lexical resources for low-resource languages, such as Arabic dialects, remains a challenging yet essential endeavor. One major difficulty lies in the reliable identification of appropriate synonyms, which requires both rich lexical data and robust machine learning techniques. This study presents a synset classification framework for Darija (Moroccan Arabic), leveraging contextual embeddings derived from multiple Transformer-based language models to capture the semantic richness of the dialect. In addition to contextual similarity, we automatically extract lexical and ontological similarity features. These features are combined and used as input to supervised classification algorithms. Several classifiers were evaluated, including Logistic Regression, Random Forest, Decision Tree, and Gradient Boosting. The models were trained to predict the most appropriate WordNet synset for each Darija word, with performance assessed through k-fold cross-validation. Experimental results confirm the effectiveness of the proposed approach, with the best-performing model achieving an accuracy of 73.28% and an F1-score of 84.81%, underscoring the potential of Transformer-based embeddings in advancing lexical resource development for under-resourced languages.
Building a Machine Learning Classifier for Synonyms Validation in Moroccan Darija
Nahli O.Supervision
;
2025
Abstract
Building lexical resources for low-resource languages, such as Arabic dialects, remains a challenging yet essential endeavor. One major difficulty lies in the reliable identification of appropriate synonyms, which requires both rich lexical data and robust machine learning techniques. This study presents a synset classification framework for Darija (Moroccan Arabic), leveraging contextual embeddings derived from multiple Transformer-based language models to capture the semantic richness of the dialect. In addition to contextual similarity, we automatically extract lexical and ontological similarity features. These features are combined and used as input to supervised classification algorithms. Several classifiers were evaluated, including Logistic Regression, Random Forest, Decision Tree, and Gradient Boosting. The models were trained to predict the most appropriate WordNet synset for each Darija word, with performance assessed through k-fold cross-validation. Experimental results confirm the effectiveness of the proposed approach, with the best-performing model achieving an accuracy of 73.28% and an F1-score of 84.81%, underscoring the potential of Transformer-based embeddings in advancing lexical resource development for under-resourced languages.| Campo DC | Valore | Lingua |
|---|---|---|
| dc.authority.orgunit | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | en |
| dc.authority.people | Belbachir S. | en |
| dc.authority.people | Nahli O. | en |
| dc.authority.people | El Mohajir M. | en |
| dc.authority.people | Chahhou M. | en |
| dc.collection.id.s | 71c7200a-7c5f-4e83-8d57-d3d2ba88f40d | * |
| dc.collection.name | 04.01 Contributo in Atti di convegno | * |
| dc.contributor.appartenenza | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | * |
| dc.contributor.appartenenza.mi | 918 | * |
| dc.contributor.area | Non assegn | * |
| dc.date.accessioned | 2026/03/03 17:01:47 | - |
| dc.date.available | 2026/03/03 17:01:47 | - |
| dc.date.firstsubmission | 2026/02/03 11:37:46 | * |
| dc.date.issued | 2025 | - |
| dc.date.submission | 2026/02/03 11:37:46 | * |
| dc.description.abstract | Building lexical resources for low-resource languages, such as Arabic dialects, remains a challenging yet essential endeavor. One major difficulty lies in the reliable identification of appropriate synonyms, which requires both rich lexical data and robust machine learning techniques. This study presents a synset classification framework for Darija (Moroccan Arabic), leveraging contextual embeddings derived from multiple Transformer-based language models to capture the semantic richness of the dialect. In addition to contextual similarity, we automatically extract lexical and ontological similarity features. These features are combined and used as input to supervised classification algorithms. Several classifiers were evaluated, including Logistic Regression, Random Forest, Decision Tree, and Gradient Boosting. The models were trained to predict the most appropriate WordNet synset for each Darija word, with performance assessed through k-fold cross-validation. Experimental results confirm the effectiveness of the proposed approach, with the best-performing model achieving an accuracy of 73.28% and an F1-score of 84.81%, underscoring the potential of Transformer-based embeddings in advancing lexical resource development for under-resourced languages. | - |
| dc.description.allpeople | Belbachir, S.; Nahli, O.; El Mohajir, M.; Chahhou, M. | - |
| dc.description.allpeopleoriginal | Belbachir S.; Nahli O.; El Mohajir M.; Chahhou M. | en |
| dc.description.fulltext | restricted | en |
| dc.description.numberofauthors | 4 | - |
| dc.identifier.doi | 10.1109/CiSt65886.2025.11224302 | en |
| dc.identifier.scopus | 2-s2.0-105024969055 | en |
| dc.identifier.source | scopus | * |
| dc.identifier.uri | https://hdl.handle.net/20.500.14243/566029 | - |
| dc.language.iso | eng | en |
| dc.relation.firstpage | 80 | en |
| dc.relation.ispartofbook | 8th IEEE Congress on Information Science and Technology | en |
| dc.relation.lastpage | 87 | en |
| dc.relation.numberofpages | 8 | en |
| dc.subject.keywordseng | Princeton WordNet (PWN);Darija,Natural Language Processing (NLP);Logistic Regression (LR);SUMO;Semantic Similarity;Cosine Similarity;Machine Learning. | - |
| dc.subject.singlekeyword | Princeton WordNet (PWN) | * |
| dc.subject.singlekeyword | Darija | * |
| dc.subject.singlekeyword | Natural Language Processing (NLP) | * |
| dc.subject.singlekeyword | Logistic Regression (LR) | * |
| dc.subject.singlekeyword | SUMO | * |
| dc.subject.singlekeyword | Semantic Similarity | * |
| dc.subject.singlekeyword | Cosine Similarity | * |
| dc.subject.singlekeyword | Machine Learning. | * |
| dc.title | Building a Machine Learning Classifier for Synonyms Validation in Moroccan Darija | en |
| dc.type.driver | info:eu-repo/semantics/conferenceObject | - |
| dc.type.full | 04 Contributo in convegno::04.01 Contributo in Atti di convegno | it |
| dc.type.miur | 273 | - |
| iris.mediafilter.data | 2026/03/04 02:52:05 | * |
| iris.orcid.lastModifiedDate | 2026/03/03 17:01:47 | * |
| iris.orcid.lastModifiedMillisecond | 1772553707435 | * |
| iris.scopus.extIssued | 2025 | - |
| iris.scopus.extTitle | Building a Machine Learning Classifier for Synonyms Validation in Moroccan Darija | - |
| iris.sitodocente.maxattempts | 1 | - |
| iris.unpaywall.doi | 10.1109/cist65886.2025.11224302 | * |
| iris.unpaywall.isoa | false | * |
| iris.unpaywall.metadataCallLastModified | 04/03/2026 04:33:51 | - |
| iris.unpaywall.metadataCallLastModifiedMillisecond | 1772595231207 | - |
| iris.unpaywall.oastatus | closed | * |
| scopus.category | 1711 | * |
| scopus.category | 1706 | * |
| scopus.category | 1803 | * |
| scopus.category | 1802 | * |
| scopus.contributor.affiliation | New Technology Trends for Innovation Laboratory | - |
| scopus.contributor.affiliation | Instituto di Linguistica Computazionale | - |
| scopus.contributor.affiliation | New Technology Trends for Innovation Laboratory | - |
| scopus.contributor.affiliation | New Technology Trends for Innovation Laboratory | - |
| scopus.contributor.afid | 60025506 | - |
| scopus.contributor.afid | 60021199 | - |
| scopus.contributor.afid | 60025506 | - |
| scopus.contributor.afid | 60025506 | - |
| scopus.contributor.auid | 60241555200 | - |
| scopus.contributor.auid | 56741333300 | - |
| scopus.contributor.auid | 60017115600 | - |
| scopus.contributor.auid | 36801152800 | - |
| scopus.contributor.country | Morocco | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Morocco | - |
| scopus.contributor.country | Morocco | - |
| scopus.contributor.dptid | 104292580 | - |
| scopus.contributor.dptid | - | |
| scopus.contributor.dptid | 104292580 | - |
| scopus.contributor.dptid | 104292580 | - |
| scopus.contributor.name | Said | - |
| scopus.contributor.name | Ouafae | - |
| scopus.contributor.name | Mohammed | - |
| scopus.contributor.name | Mohamed | - |
| scopus.contributor.subaffiliation | Faculty of Sciences;Abdelmalek Essaadi University; | - |
| scopus.contributor.subaffiliation | Consiglio Nazionale Delle Ricerche; | - |
| scopus.contributor.subaffiliation | Faculty of Sciences;Abdelmalek Essaadi University; | - |
| scopus.contributor.subaffiliation | Faculty of Sciences;Abdelmalek Essaadi University; | - |
| scopus.contributor.surname | Belbachir | - |
| scopus.contributor.surname | Nahli | - |
| scopus.contributor.surname | El Mohajir | - |
| scopus.contributor.surname | Chahhou | - |
| scopus.date.issued | 2025 | * |
| scopus.description.abstract | Building lexical resources for low-resource languages, such as Arabic dialects, remains a challenging yet essential endeavor. One major difficulty lies in the reliable identification of appropriate synonyms, which requires both rich lexical data and robust machine learning techniques. This study presents a synset classification framework for Darija (Moroccan Arabic), leveraging contextual embeddings derived from multiple Transformer-based language models to capture the semantic richness of the dialect. In addition to contextual similarity, we automatically extract lexical and ontological similarity features. These features are combined and used as input to supervised classification algorithms. Several classifiers were evaluated, including Logistic Regression, Random Forest, Decision Tree, and Gradient Boosting. The models were trained to predict the most appropriate WordNet synset for each Darija word, with performance assessed through k-fold cross-validation. Experimental results confirm the effectiveness of the proposed approach, with the best-performing model achieving an accuracy of 73.28% and an F1-score of 84.81%, underscoring the potential of Transformer-based embeddings in advancing lexical resource development for under-resourced languages. | * |
| scopus.description.allpeopleoriginal | Belbachir S.; Nahli O.; El Mohajir M.; Chahhou M. | * |
| scopus.differences | scopus.publisher.name | * |
| scopus.differences | scopus.subject.keywords | * |
| scopus.differences | scopus.relation.conferencedate | * |
| scopus.differences | scopus.relation.conferencename | * |
| scopus.differences | scopus.identifier.isbn | * |
| scopus.differences | scopus.relation.conferenceplace | * |
| scopus.document.type | cp | * |
| scopus.document.types | cp | * |
| scopus.funding.funders | 501100000780 - European Commission; | * |
| scopus.funding.ids | GA 101086252; | * |
| scopus.identifier.doi | 10.1109/CiSt65886.2025.11224302 | * |
| scopus.identifier.eissn | 2327-1884 | * |
| scopus.identifier.isbn | 9798331543846 | * |
| scopus.identifier.pui | 649555984 | * |
| scopus.identifier.scopus | 2-s2.0-105024969055 | * |
| scopus.journal.sourceid | 21100400809 | * |
| scopus.language.iso | eng | * |
| scopus.publisher.name | Institute of Electrical and Electronics Engineers Inc. | * |
| scopus.relation.conferencedate | 2025 | * |
| scopus.relation.conferencename | 8th IEEE International Congress on Information Science and Technology, CiSt 2025 | * |
| scopus.relation.conferenceplace | mar | * |
| scopus.relation.firstpage | 80 | * |
| scopus.relation.lastpage | 87 | * |
| scopus.subject.keywords | Cosine Similarity; Darija; Logistic Regression (LR); Machine Learning; Natural Language Processing (NLP); Princeton WordNet (PWN); Semantic Similarity; SUMO; | * |
| scopus.title | Building a Machine Learning Classifier for Synonyms Validation in Moroccan Darija | * |
| scopus.titleeng | Building a Machine Learning Classifier for Synonyms Validation in Moroccan Darija | * |
| Appare nelle tipologie: | 04.01 Contributo in Atti di convegno | |
| File | Dimensione | Formato | |
|---|---|---|---|
|
Building_a_Machine_Learning_Classifier_for_Synonyms_Validation_in_Moroccan_Darija.pdf
solo utenti autorizzati
Licenza:
NON PUBBLICO - Accesso privato/ristretto
Dimensione
1.16 MB
Formato
Adobe PDF
|
1.16 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


