CNR Institutional Research Information System

The construction of lexical-semantic resources for low-resource languages is a crucial task in natural language processing, as it enables the development of linguistic technologies that are often unavailable for under-represented languages. In this study, we present a methodology for building a structured lexical-semantic resource for the Moroccan Arabic dialect (Darija) as a low-resource language. Our method consists of two steps. The first step involves mapping a lexico-semantic network for Darija to the Princeton WordNet and the Suggested Upper Merged Ontology (SUMO). We apply the Lexical Ontology Inference (LeOnI) framework to link Darija words using a bilingual resource and existing WordNet-SUMO mapping. Darija words are classified as monosemous or polysemous to guide the mapping process. The second step introduces a similarity-based refinement process, combining semantic similarity components with ontological-lexical adjustment factors. A scoring function reliably guides the automatic mapping and disambiguation of synsets and concepts. Our results demonstrate that the combination of symbolic and distributional semantics yields accurate and interpretable wordnet-like resources for dialects. We also analyze semantic coverage and translation gaps, highlighting concepts that are untranslatable or culturally specific to Darija. The proposed framework can be generalized to other low-resource languages, as the core mapping and refinement stages in our method are language-independent. Once the Darija lexical-semantic resource is finalized, the constructed dataset will be made publicly available to promote reproducibility and facilitate research into Arabic semantic processing and dialectal natural language processing.

Building a semantic resource for the Moroccan dialect: a hybrid approach with LeOnI and semantic similarity

Nahli, Ouafae^{Ultimo

Data Curation}

2026

Abstract

The construction of lexical-semantic resources for low-resource languages is a crucial task in natural language processing, as it enables the development of linguistic technologies that are often unavailable for under-represented languages. In this study, we present a methodology for building a structured lexical-semantic resource for the Moroccan Arabic dialect (Darija) as a low-resource language. Our method consists of two steps. The first step involves mapping a lexico-semantic network for Darija to the Princeton WordNet and the Suggested Upper Merged Ontology (SUMO). We apply the Lexical Ontology Inference (LeOnI) framework to link Darija words using a bilingual resource and existing WordNet-SUMO mapping. Darija words are classified as monosemous or polysemous to guide the mapping process. The second step introduces a similarity-based refinement process, combining semantic similarity components with ontological-lexical adjustment factors. A scoring function reliably guides the automatic mapping and disambiguation of synsets and concepts. Our results demonstrate that the combination of symbolic and distributional semantics yields accurate and interpretable wordnet-like resources for dialects. We also analyze semantic coverage and translation gaps, highlighting concepts that are untranslatable or culturally specific to Darija. The proposed framework can be generalized to other low-resource languages, as the core mapping and refinement stages in our method are language-independent. Once the Darija lexical-semantic resource is finalized, the constructed dataset will be made publicly available to promote reproducibility and facilitate research into Arabic semantic processing and dialectal natural language processing.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2026
			
	Strutture organizzative
	
				Istituto di linguistica computazionale "Antonio Zampolli" - ILC
			
	Parole chiave
	
				Moroccan Arabic (Darija),Lexical-semantic resources,Cross-lingual semantic mapping,Low-resource language processing,Semantic similarity,Word sense disambiguation
			
	Appare nelle tipologie:
	
				01.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
s10579-026-09920-0.pdf solo utenti autorizzati Licenza: NON PUBBLICO - Accesso privato/ristretto Dimensione 1.66 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.66 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/579744

Citazioni

ND

ND

ND

social impact