This work presents DiMorph, a morphological engine for Moroccan Arabic (Darija), integrating custom pre- and post-processing techniques to address orthographic inconsistency and lack of standardization. A key feature of DiMorph is its multiword expression (MWE) recognition module, which enhances analysis by detecting and processing MWEs based on a predefined lexicon, leading to more accurate gloss generation. Tested on a Facebook corpus of 11,085 tokens, DiMorph achieved 97.84% in-vocabulary (INV) coverage, with an out-of-vocabulary (OOV) rate of 2.16%, mostly consisting of foreign terms, proper names and emerging words. In all, 40.48% of tokens had a single interpretation, while 59.52% exhibited ambiguity, largely due to homography (89.71%), polysemy (9.31%) and morphological syncretism (0.98%). By providing robust morphological analysis and MWE handling, DiMorph significantly enhances Darija text processing. Its linguistic resources will be released as open-source, fostering further advancements in Arabic dialect natural language processing (NLP).
A Robust Morphological Analysis System for the Moroccan Dialect
Khlif, Nadia
;Nahli, Ouafae
2026
Abstract
This work presents DiMorph, a morphological engine for Moroccan Arabic (Darija), integrating custom pre- and post-processing techniques to address orthographic inconsistency and lack of standardization. A key feature of DiMorph is its multiword expression (MWE) recognition module, which enhances analysis by detecting and processing MWEs based on a predefined lexicon, leading to more accurate gloss generation. Tested on a Facebook corpus of 11,085 tokens, DiMorph achieved 97.84% in-vocabulary (INV) coverage, with an out-of-vocabulary (OOV) rate of 2.16%, mostly consisting of foreign terms, proper names and emerging words. In all, 40.48% of tokens had a single interpretation, while 59.52% exhibited ambiguity, largely due to homography (89.71%), polysemy (9.31%) and morphological syncretism (0.98%). By providing robust morphological analysis and MWE handling, DiMorph significantly enhances Darija text processing. Its linguistic resources will be released as open-source, fostering further advancements in Arabic dialect natural language processing (NLP).| Campo DC | Valore | Lingua |
|---|---|---|
| dc.authority.orgunit | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | en |
| dc.authority.people | Khlif, Nadia | en |
| dc.authority.people | Mazroui, Azzedine | en |
| dc.authority.people | Nahli, Ouafae | en |
| dc.collection.id.s | 8c50ea44-be95-498f-946e-7bb5bd666b7c | * |
| dc.collection.name | 02.01 Contributo in volume (Capitolo o Saggio) | * |
| dc.contributor.appartenenza | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | * |
| dc.contributor.appartenenza.mi | 918 | * |
| dc.contributor.area | Non assegn | * |
| dc.contributor.area | Non assegn | * |
| dc.date.firstsubmission | 2026/02/03 11:47:22 | * |
| dc.date.issued | 2026 | - |
| dc.date.submission | 2026/05/08 15:56:16 | * |
| dc.description.abstracteng | This work presents DiMorph, a morphological engine for Moroccan Arabic (Darija), integrating custom pre- and post-processing techniques to address orthographic inconsistency and lack of standardization. A key feature of DiMorph is its multiword expression (MWE) recognition module, which enhances analysis by detecting and processing MWEs based on a predefined lexicon, leading to more accurate gloss generation. Tested on a Facebook corpus of 11,085 tokens, DiMorph achieved 97.84% in-vocabulary (INV) coverage, with an out-of-vocabulary (OOV) rate of 2.16%, mostly consisting of foreign terms, proper names and emerging words. In all, 40.48% of tokens had a single interpretation, while 59.52% exhibited ambiguity, largely due to homography (89.71%), polysemy (9.31%) and morphological syncretism (0.98%). By providing robust morphological analysis and MWE handling, DiMorph significantly enhances Darija text processing. Its linguistic resources will be released as open-source, fostering further advancements in Arabic dialect natural language processing (NLP). | - |
| dc.description.allpeople | Khlif, Nadia; Mazroui, Azzedine; Nahli, Ouafae | - |
| dc.description.allpeopleoriginal | Khlif, Nadia; Mazroui, Azzedine; Nahli, Ouafae | en |
| dc.description.fulltext | none | en |
| dc.description.numberofauthors | 3 | - |
| dc.identifier.doi | 10.1201/9781003671602 | en |
| dc.identifier.isbn | 9781003671602 | en |
| dc.identifier.source | manual | * |
| dc.identifier.uri | https://hdl.handle.net/20.500.14243/566042 | - |
| dc.identifier.url | https://doi.org/10.1201/9781003671602 | en |
| dc.language.iso | eng | en |
| dc.publisher.country | USA | en |
| dc.publisher.name | CRC Press – Taylor & Francis Group | en |
| dc.publisher.place | Boca Raton | en |
| dc.relation.allauthors | Azrour, Mourade; Guezzaz, Azidine; Jabbour, Said | en |
| dc.relation.ispartofbook | Smart Technologies for a Sustainable Environment | en |
| dc.subject.keywordseng | Morphological engine, DiMorph, Moroccan dialect, Multiword expressions, Darija, Text processing. | - |
| dc.subject.singlekeyword | Morphological engine | * |
| dc.subject.singlekeyword | DiMorph | * |
| dc.subject.singlekeyword | Moroccan dialect | * |
| dc.subject.singlekeyword | Multiword expressions | * |
| dc.subject.singlekeyword | Darija | * |
| dc.subject.singlekeyword | Text processing. | * |
| dc.title | A Robust Morphological Analysis System for the Moroccan Dialect | en |
| dc.type.driver | info:eu-repo/semantics/bookPart | - |
| dc.type.full | 02 Contributo in Volume::02.01 Contributo in volume (Capitolo o Saggio) | it |
| dc.type.miur | 268 | - |
| dc.type.referee | Sì, ma tipo non specificato | en |
| iris.orcid.lastModifiedDate | 2026/05/08 15:56:16 | * |
| iris.orcid.lastModifiedMillisecond | 1778248576694 | * |
| iris.sitodocente.maxattempts | 1 | - |
| iris.unpaywall.doi | 10.1201/9781003671602 | * |
| iris.unpaywall.isoa | false | * |
| iris.unpaywall.journalisindoaj | false | * |
| iris.unpaywall.metadataCallLastModified | 09/05/2026 05:29:31 | - |
| iris.unpaywall.metadataCallLastModifiedMillisecond | 1778297371062 | - |
| iris.unpaywall.oastatus | closed | * |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


