Extracting semantic relations between words is crucial for the development and enrichment of lexical resources, especially for under-resourced languages like Moroccan Darija. This paper presents an automated methodology for identifying synonyms, antonyms, hypernyms, and hyponyms by leveraging bilingual Darija-English resources, Princeton WordNet (PWN), the Suggested Upper Merged Ontology (SUMO), and the NLTK toolkit. Experimental evaluation was conducted on a dataset of 361 Darija nouns, selected as a preliminary testbed to validate the methodology before scaling it to the full lexicon. The results show that 83.10% were successfully aligned with PWN synsets, resulting in the extraction of 14,201 semantic relations, of which 5,475 (38.55%) were validated through back-translation. These findings confirm the potential of transferring semantic knowledge from English into Darija, despite cultural and lexical mismatches. The proposed pipeline substantially enriches Darija's lexical coverage and offers a scalable and replicable approach for developing semantic resources in other low-resource dialects. © 2025 IEEE.
A Proposed Approach for Extracting Semantic and Lexical Relations for Low-Resource Languages: A Case Study of Darija
Khlif Nadia
Data Curation
;Nahli O.
Supervision
2025
Abstract
Extracting semantic relations between words is crucial for the development and enrichment of lexical resources, especially for under-resourced languages like Moroccan Darija. This paper presents an automated methodology for identifying synonyms, antonyms, hypernyms, and hyponyms by leveraging bilingual Darija-English resources, Princeton WordNet (PWN), the Suggested Upper Merged Ontology (SUMO), and the NLTK toolkit. Experimental evaluation was conducted on a dataset of 361 Darija nouns, selected as a preliminary testbed to validate the methodology before scaling it to the full lexicon. The results show that 83.10% were successfully aligned with PWN synsets, resulting in the extraction of 14,201 semantic relations, of which 5,475 (38.55%) were validated through back-translation. These findings confirm the potential of transferring semantic knowledge from English into Darija, despite cultural and lexical mismatches. The proposed pipeline substantially enriches Darija's lexical coverage and offers a scalable and replicable approach for developing semantic resources in other low-resource dialects. © 2025 IEEE.| File | Dimensione | Formato | |
|---|---|---|---|
|
A_Proposed_Approach_for_Extracting_Semantic_and_Lexical_Relations_for_Low-Resource_Languages_A_Case_Study_of_Darija.pdf
solo utenti autorizzati
Tipologia:
Versione Editoriale (PDF)
Licenza:
NON PUBBLICO - Accesso privato/ristretto
Dimensione
1.24 MB
Formato
Adobe PDF
|
1.24 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


