The construction of lexical-semantic resources for low-resource languages is a crucial task in natural language processing, as it enables the development of linguistic technologies that are often unavailable for under-represented languages. In this study, we present a methodology for building a structured lexical-semantic resource for the Moroccan Arabic dialect (Darija) as a low-resource language. Our method consists of two steps. The first step involves mapping a lexico-semantic network for Darija to the Princeton WordNet and the Suggested Upper Merged Ontology (SUMO). We apply the Lexical Ontology Inference (LeOnI) framework to link Darija words using a bilingual resource and existing WordNet-SUMO mapping. Darija words are classified as monosemous or polysemous to guide the mapping process. The second step introduces a similarity-based refinement process, combining semantic similarity components with ontological-lexical adjustment factors. A scoring function reliably guides the automatic mapping and disambiguation of synsets and concepts. Our results demonstrate that the combination of symbolic and distributional semantics yields accurate and interpretable wordnet-like resources for dialects. We also analyze semantic coverage and translation gaps, highlighting concepts that are untranslatable or culturally specific to Darija. The proposed framework can be generalized to other low-resource languages, as the core mapping and refinement stages in our method are language-independent. Once the Darija lexical-semantic resource is finalized, the constructed dataset will be made publicly available to promote reproducibility and facilitate research into Arabic semantic processing and dialectal natural language processing.

Building a semantic resource for the Moroccan dialect: a hybrid approach with LeOnI and semantic similarity

Nahli, Ouafae
Ultimo
Data Curation
2026

Abstract

The construction of lexical-semantic resources for low-resource languages is a crucial task in natural language processing, as it enables the development of linguistic technologies that are often unavailable for under-represented languages. In this study, we present a methodology for building a structured lexical-semantic resource for the Moroccan Arabic dialect (Darija) as a low-resource language. Our method consists of two steps. The first step involves mapping a lexico-semantic network for Darija to the Princeton WordNet and the Suggested Upper Merged Ontology (SUMO). We apply the Lexical Ontology Inference (LeOnI) framework to link Darija words using a bilingual resource and existing WordNet-SUMO mapping. Darija words are classified as monosemous or polysemous to guide the mapping process. The second step introduces a similarity-based refinement process, combining semantic similarity components with ontological-lexical adjustment factors. A scoring function reliably guides the automatic mapping and disambiguation of synsets and concepts. Our results demonstrate that the combination of symbolic and distributional semantics yields accurate and interpretable wordnet-like resources for dialects. We also analyze semantic coverage and translation gaps, highlighting concepts that are untranslatable or culturally specific to Darija. The proposed framework can be generalized to other low-resource languages, as the core mapping and refinement stages in our method are language-independent. Once the Darija lexical-semantic resource is finalized, the constructed dataset will be made publicly available to promote reproducibility and facilitate research into Arabic semantic processing and dialectal natural language processing.
Campo DC Valore Lingua
dc.authority.ancejournal LANGUAGE RESOURCES AND EVALUATION en
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC en
dc.authority.people Belbachir, Said en
dc.authority.people Mohajir, Mohammed El en
dc.authority.people Chahhou, Mohamed en
dc.authority.people Nahli, Ouafae en
dc.collection.id.s b3f88f24-048a-4e43-8ab1-6697b90e068e *
dc.collection.name 01.01 Articolo in rivista *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.contributor.area Non assegn *
dc.date.accessioned 2026/05/25 15:56:45 -
dc.date.available 2026/05/25 15:56:45 -
dc.date.firstsubmission 2026/05/07 16:53:11 *
dc.date.issued 2026 -
dc.date.submission 2026/05/07 16:53:11 *
dc.description.abstracteng The construction of lexical-semantic resources for low-resource languages is a crucial task in natural language processing, as it enables the development of linguistic technologies that are often unavailable for under-represented languages. In this study, we present a methodology for building a structured lexical-semantic resource for the Moroccan Arabic dialect (Darija) as a low-resource language. Our method consists of two steps. The first step involves mapping a lexico-semantic network for Darija to the Princeton WordNet and the Suggested Upper Merged Ontology (SUMO). We apply the Lexical Ontology Inference (LeOnI) framework to link Darija words using a bilingual resource and existing WordNet-SUMO mapping. Darija words are classified as monosemous or polysemous to guide the mapping process. The second step introduces a similarity-based refinement process, combining semantic similarity components with ontological-lexical adjustment factors. A scoring function reliably guides the automatic mapping and disambiguation of synsets and concepts. Our results demonstrate that the combination of symbolic and distributional semantics yields accurate and interpretable wordnet-like resources for dialects. We also analyze semantic coverage and translation gaps, highlighting concepts that are untranslatable or culturally specific to Darija. The proposed framework can be generalized to other low-resource languages, as the core mapping and refinement stages in our method are language-independent. Once the Darija lexical-semantic resource is finalized, the constructed dataset will be made publicly available to promote reproducibility and facilitate research into Arabic semantic processing and dialectal natural language processing. -
dc.description.allpeople Belbachir, Said; Mohajir, Mohammed El; Chahhou, Mohamed; Nahli, Ouafae -
dc.description.allpeopleoriginal Belbachir, Said; Mohajir, Mohammed El; Chahhou, Mohamed; Nahli, Ouafae en
dc.description.fulltext restricted en
dc.description.numberofauthors 4 -
dc.identifier.doi 10.1007/s10579-026-09920-0 en
dc.identifier.source crossref *
dc.identifier.uri https://hdl.handle.net/20.500.14243/579744 -
dc.language.iso eng en
dc.relation.issue 2 en
dc.relation.medium ELETTRONICO en
dc.relation.volume 60 en
dc.subject.keywordseng Moroccan Arabic (Darija),Lexical-semantic resources,Cross-lingual semantic mapping,Low-resource language processing,Semantic similarity,Word sense disambiguation -
dc.subject.singlekeyword Moroccan Arabic (Darija) *
dc.subject.singlekeyword Lexical-semantic resources *
dc.subject.singlekeyword Cross-lingual semantic mapping *
dc.subject.singlekeyword Low-resource language processing *
dc.subject.singlekeyword Semantic similarity *
dc.subject.singlekeyword Word sense disambiguation *
dc.title Building a semantic resource for the Moroccan dialect: a hybrid approach with LeOnI and semantic similarity en
dc.type.circulation Internazionale en
dc.type.driver info:eu-repo/semantics/article -
dc.type.full 01 Contributo su Rivista::01.01 Articolo in rivista it
dc.type.miur 262 -
iris.mediafilter.data 2026/05/26 03:21:44 *
iris.orcid.lastModifiedDate 2026/05/25 15:56:45 *
iris.orcid.lastModifiedMillisecond 1779717405770 *
iris.sitodocente.maxattempts 1 -
iris.unpaywall.doi 10.1007/s10579-026-09920-0 *
iris.unpaywall.isoa false *
iris.unpaywall.journalisindoaj false *
iris.unpaywall.metadataCallLastModified 26/05/2026 05:35:49 -
iris.unpaywall.metadataCallLastModifiedMillisecond 1779766549086 -
iris.unpaywall.oastatus closed *
Appare nelle tipologie: 01.01 Articolo in rivista
File in questo prodotto:
File Dimensione Formato  
s10579-026-09920-0.pdf

solo utenti autorizzati

Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 1.66 MB
Formato Adobe PDF
1.66 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/579744
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact