This work presents DiMorph, a morphological engine for Moroccan Arabic (Darija), integrating custom pre- and post-processing techniques to address orthographic inconsistency and lack of standardization. A key feature of DiMorph is its multiword expression (MWE) recognition module, which enhances analysis by detecting and processing MWEs based on a predefined lexicon, leading to more accurate gloss generation. Tested on a Facebook corpus of 11,085 tokens, DiMorph achieved 97.84% in-vocabulary (INV) coverage, with an out-of-vocabulary (OOV) rate of 2.16%, mostly consisting of foreign terms, proper names and emerging words. In all, 40.48% of tokens had a single interpretation, while 59.52% exhibited ambiguity, largely due to homography (89.71%), polysemy (9.31%) and morphological syncretism (0.98%). By providing robust morphological analysis and MWE handling, DiMorph significantly enhances Darija text processing. Its linguistic resources will be released as open-source, fostering further advancements in Arabic dialect natural language processing (NLP).

A Robust Morphological Analysis System for the Moroccan Dialect

Khlif, Nadia
;
Nahli, Ouafae
2026

Abstract

This work presents DiMorph, a morphological engine for Moroccan Arabic (Darija), integrating custom pre- and post-processing techniques to address orthographic inconsistency and lack of standardization. A key feature of DiMorph is its multiword expression (MWE) recognition module, which enhances analysis by detecting and processing MWEs based on a predefined lexicon, leading to more accurate gloss generation. Tested on a Facebook corpus of 11,085 tokens, DiMorph achieved 97.84% in-vocabulary (INV) coverage, with an out-of-vocabulary (OOV) rate of 2.16%, mostly consisting of foreign terms, proper names and emerging words. In all, 40.48% of tokens had a single interpretation, while 59.52% exhibited ambiguity, largely due to homography (89.71%), polysemy (9.31%) and morphological syncretism (0.98%). By providing robust morphological analysis and MWE handling, DiMorph significantly enhances Darija text processing. Its linguistic resources will be released as open-source, fostering further advancements in Arabic dialect natural language processing (NLP).
2026
Istituto di linguistica computazionale "Antonio Zampolli" - ILC
9781003671602
Morphological engine, DiMorph, Moroccan dialect, Multiword expressions, Darija, Text processing.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/566042
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ente

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact