The construction of lexical-semantic resources for low-resource languages is a crucial task in natural language processing, as it enables the development of linguistic technologies that are often unavailable for under-represented languages. In this study, we present a methodology for building a structured lexical-semantic resource for the Moroccan Arabic dialect (Darija) as a low-resource language. Our method consists of two steps. The first step involves mapping a lexico-semantic network for Darija to the Princeton WordNet and the Suggested Upper Merged Ontology (SUMO). We apply the Lexical Ontology Inference (LeOnI) framework to link Darija words using a bilingual resource and existing WordNet-SUMO mapping. Darija words are classified as monosemous or polysemous to guide the mapping process. The second step introduces a similarity-based refinement process, combining semantic similarity components with ontological-lexical adjustment factors. A scoring function reliably guides the automatic mapping and disambiguation of synsets and concepts. Our results demonstrate that the combination of symbolic and distributional semantics yields accurate and interpretable wordnet-like resources for dialects. We also analyze semantic coverage and translation gaps, highlighting concepts that are untranslatable or culturally specific to Darija. The proposed framework can be generalized to other low-resource languages, as the core mapping and refinement stages in our method are language-independent. Once the Darija lexical-semantic resource is finalized, the constructed dataset will be made publicly available to promote reproducibility and facilitate research into Arabic semantic processing and dialectal natural language processing.

Building a semantic resource for the Moroccan dialect: a hybrid approach with LeOnI and semantic similarity

Nahli, Ouafae
Ultimo
Data Curation
2026

Abstract

The construction of lexical-semantic resources for low-resource languages is a crucial task in natural language processing, as it enables the development of linguistic technologies that are often unavailable for under-represented languages. In this study, we present a methodology for building a structured lexical-semantic resource for the Moroccan Arabic dialect (Darija) as a low-resource language. Our method consists of two steps. The first step involves mapping a lexico-semantic network for Darija to the Princeton WordNet and the Suggested Upper Merged Ontology (SUMO). We apply the Lexical Ontology Inference (LeOnI) framework to link Darija words using a bilingual resource and existing WordNet-SUMO mapping. Darija words are classified as monosemous or polysemous to guide the mapping process. The second step introduces a similarity-based refinement process, combining semantic similarity components with ontological-lexical adjustment factors. A scoring function reliably guides the automatic mapping and disambiguation of synsets and concepts. Our results demonstrate that the combination of symbolic and distributional semantics yields accurate and interpretable wordnet-like resources for dialects. We also analyze semantic coverage and translation gaps, highlighting concepts that are untranslatable or culturally specific to Darija. The proposed framework can be generalized to other low-resource languages, as the core mapping and refinement stages in our method are language-independent. Once the Darija lexical-semantic resource is finalized, the constructed dataset will be made publicly available to promote reproducibility and facilitate research into Arabic semantic processing and dialectal natural language processing.
Campo DC Valore Lingua
dc.authority.ancejournal LANGUAGE RESOURCES AND EVALUATION en
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC en
dc.authority.people Belbachir, Said en
dc.authority.people Mohajir, Mohammed El en
dc.authority.people Chahhou, Mohamed en
dc.authority.people Nahli, Ouafae en
dc.collection.id.s b3f88f24-048a-4e43-8ab1-6697b90e068e *
dc.collection.name 01.01 Articolo in rivista *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.contributor.area Non assegn *
dc.date.firstsubmission 2026/05/07 16:53:11 *
dc.date.issued 2026 -
dc.date.submission 2026/05/07 16:53:11 *
dc.description.abstracteng The construction of lexical-semantic resources for low-resource languages is a crucial task in natural language processing, as it enables the development of linguistic technologies that are often unavailable for under-represented languages. In this study, we present a methodology for building a structured lexical-semantic resource for the Moroccan Arabic dialect (Darija) as a low-resource language. Our method consists of two steps. The first step involves mapping a lexico-semantic network for Darija to the Princeton WordNet and the Suggested Upper Merged Ontology (SUMO). We apply the Lexical Ontology Inference (LeOnI) framework to link Darija words using a bilingual resource and existing WordNet-SUMO mapping. Darija words are classified as monosemous or polysemous to guide the mapping process. The second step introduces a similarity-based refinement process, combining semantic similarity components with ontological-lexical adjustment factors. A scoring function reliably guides the automatic mapping and disambiguation of synsets and concepts. Our results demonstrate that the combination of symbolic and distributional semantics yields accurate and interpretable wordnet-like resources for dialects. We also analyze semantic coverage and translation gaps, highlighting concepts that are untranslatable or culturally specific to Darija. The proposed framework can be generalized to other low-resource languages, as the core mapping and refinement stages in our method are language-independent. Once the Darija lexical-semantic resource is finalized, the constructed dataset will be made publicly available to promote reproducibility and facilitate research into Arabic semantic processing and dialectal natural language processing. -
dc.description.allpeople Belbachir, Said; Mohajir, Mohammed El; Chahhou, Mohamed; Nahli, Ouafae -
dc.description.allpeopleoriginal Belbachir, Said; Mohajir, Mohammed El; Chahhou, Mohamed; Nahli, Ouafae en
dc.description.fulltext none en
dc.description.numberofauthors 4 -
dc.identifier.doi 10.1007/s10579-026-09920-0 en
dc.identifier.source crossref *
dc.identifier.uri https://hdl.handle.net/20.500.14243/579744 -
dc.language.iso eng en
dc.relation.issue 2 en
dc.relation.medium ELETTRONICO en
dc.relation.volume 60 en
dc.subject.keywordseng Moroccan Arabic (Darija),Lexical-semantic resources,Cross-lingual semantic mapping,Low-resource language processing,Semantic similarity,Word sense disambiguation -
dc.subject.singlekeyword Moroccan Arabic (Darija) *
dc.subject.singlekeyword Lexical-semantic resources *
dc.subject.singlekeyword Cross-lingual semantic mapping *
dc.subject.singlekeyword Low-resource language processing *
dc.subject.singlekeyword Semantic similarity *
dc.subject.singlekeyword Word sense disambiguation *
dc.title Building a semantic resource for the Moroccan dialect: a hybrid approach with LeOnI and semantic similarity en
dc.type.circulation Internazionale en
dc.type.driver info:eu-repo/semantics/article -
dc.type.full 01 Contributo su Rivista::01.01 Articolo in rivista it
dc.type.miur 262 -
iris.orcid.lastModifiedDate 2026/05/07 16:53:11 *
iris.orcid.lastModifiedMillisecond 1778165591431 *
iris.sitodocente.maxattempts 1 -
iris.unpaywall.doi 10.1007/s10579-026-09920-0 *
iris.unpaywall.isoa false *
iris.unpaywall.journalisindoaj false *
iris.unpaywall.metadataCallLastModified 08/05/2026 05:27:55 -
iris.unpaywall.metadataCallLastModifiedMillisecond 1778210875325 -
iris.unpaywall.oastatus closed *
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/579744
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ente

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact