This work presents DiMorph, a morphological engine for Moroccan Arabic (Darija), integrating custom pre- and post-processing techniques to address orthographic inconsistency and lack of standardization. A key feature of DiMorph is its multiword expression (MWE) recognition module, which enhances analysis by detecting and processing MWEs based on a predefined lexicon, leading to more accurate gloss generation. Tested on a Facebook corpus of 11,085 tokens, DiMorph achieved 97.84% in-vocabulary (INV) coverage, with an out-of-vocabulary (OOV) rate of 2.16%, mostly consisting of foreign terms, proper names and emerging words. In all, 40.48% of tokens had a single interpretation, while 59.52% exhibited ambiguity, largely due to homography (89.71%), polysemy (9.31%) and morphological syncretism (0.98%). By providing robust morphological analysis and MWE handling, DiMorph significantly enhances Darija text processing. Its linguistic resources will be released as open-source, fostering further advancements in Arabic dialect natural language processing (NLP).
A Robust Morphological Analysis System for the Moroccan Dialect
Khlif, Nadia
;Nahli, Ouafae
2026
Abstract
This work presents DiMorph, a morphological engine for Moroccan Arabic (Darija), integrating custom pre- and post-processing techniques to address orthographic inconsistency and lack of standardization. A key feature of DiMorph is its multiword expression (MWE) recognition module, which enhances analysis by detecting and processing MWEs based on a predefined lexicon, leading to more accurate gloss generation. Tested on a Facebook corpus of 11,085 tokens, DiMorph achieved 97.84% in-vocabulary (INV) coverage, with an out-of-vocabulary (OOV) rate of 2.16%, mostly consisting of foreign terms, proper names and emerging words. In all, 40.48% of tokens had a single interpretation, while 59.52% exhibited ambiguity, largely due to homography (89.71%), polysemy (9.31%) and morphological syncretism (0.98%). By providing robust morphological analysis and MWE handling, DiMorph significantly enhances Darija text processing. Its linguistic resources will be released as open-source, fostering further advancements in Arabic dialect natural language processing (NLP).I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


