The construction of lexical-semantic resources for low-resource languages is a crucial task in natural language processing, as it enables the development of linguistic technologies that are often unavailable for under-represented languages. In this study, we present a methodology for building a structured lexical-semantic resource for the Moroccan Arabic dialect (Darija) as a low-resource language. Our method consists of two steps. The first step involves mapping a lexico-semantic network for Darija to the Princeton WordNet and the Suggested Upper Merged Ontology (SUMO). We apply the Lexical Ontology Inference (LeOnI) framework to link Darija words using a bilingual resource and existing WordNet-SUMO mapping. Darija words are classified as monosemous or polysemous to guide the mapping process. The second step introduces a similarity-based refinement process, combining semantic similarity components with ontological-lexical adjustment factors. A scoring function reliably guides the automatic mapping and disambiguation of synsets and concepts. Our results demonstrate that the combination of symbolic and distributional semantics yields accurate and interpretable wordnet-like resources for dialects. We also analyze semantic coverage and translation gaps, highlighting concepts that are untranslatable or culturally specific to Darija. The proposed framework can be generalized to other low-resource languages, as the core mapping and refinement stages in our method are language-independent. Once the Darija lexical-semantic resource is finalized, the constructed dataset will be made publicly available to promote reproducibility and facilitate research into Arabic semantic processing and dialectal natural language processing.
Building a semantic resource for the Moroccan dialect: a hybrid approach with LeOnI and semantic similarity
Nahli, OuafaeUltimo
Data Curation
2026
Abstract
The construction of lexical-semantic resources for low-resource languages is a crucial task in natural language processing, as it enables the development of linguistic technologies that are often unavailable for under-represented languages. In this study, we present a methodology for building a structured lexical-semantic resource for the Moroccan Arabic dialect (Darija) as a low-resource language. Our method consists of two steps. The first step involves mapping a lexico-semantic network for Darija to the Princeton WordNet and the Suggested Upper Merged Ontology (SUMO). We apply the Lexical Ontology Inference (LeOnI) framework to link Darija words using a bilingual resource and existing WordNet-SUMO mapping. Darija words are classified as monosemous or polysemous to guide the mapping process. The second step introduces a similarity-based refinement process, combining semantic similarity components with ontological-lexical adjustment factors. A scoring function reliably guides the automatic mapping and disambiguation of synsets and concepts. Our results demonstrate that the combination of symbolic and distributional semantics yields accurate and interpretable wordnet-like resources for dialects. We also analyze semantic coverage and translation gaps, highlighting concepts that are untranslatable or culturally specific to Darija. The proposed framework can be generalized to other low-resource languages, as the core mapping and refinement stages in our method are language-independent. Once the Darija lexical-semantic resource is finalized, the constructed dataset will be made publicly available to promote reproducibility and facilitate research into Arabic semantic processing and dialectal natural language processing.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


