Extracting semantic relations between words is crucial for the development and enrichment of lexical resources, especially for under-resourced languages like Moroccan Darija. This paper presents an automated methodology for identifying synonyms, antonyms, hypernyms, and hyponyms by leveraging bilingual Darija-English resources, Princeton WordNet (PWN), the Suggested Upper Merged Ontology (SUMO), and the NLTK toolkit. Experimental evaluation was conducted on a dataset of 361 Darija nouns, selected as a preliminary testbed to validate the methodology before scaling it to the full lexicon. The results show that 83.10% were successfully aligned with PWN synsets, resulting in the extraction of 14,201 semantic relations, of which 5,475 (38.55%) were validated through back-translation. These findings confirm the potential of transferring semantic knowledge from English into Darija, despite cultural and lexical mismatches. The proposed pipeline substantially enriches Darija's lexical coverage and offers a scalable and replicable approach for developing semantic resources in other low-resource dialects. © 2025 IEEE.

A Proposed Approach for Extracting Semantic and Lexical Relations for Low-Resource Languages: A Case Study of Darija

Khlif Nadia
Data Curation
;
Nahli O.
Supervision
2025

Abstract

Extracting semantic relations between words is crucial for the development and enrichment of lexical resources, especially for under-resourced languages like Moroccan Darija. This paper presents an automated methodology for identifying synonyms, antonyms, hypernyms, and hyponyms by leveraging bilingual Darija-English resources, Princeton WordNet (PWN), the Suggested Upper Merged Ontology (SUMO), and the NLTK toolkit. Experimental evaluation was conducted on a dataset of 361 Darija nouns, selected as a preliminary testbed to validate the methodology before scaling it to the full lexicon. The results show that 83.10% were successfully aligned with PWN synsets, resulting in the extraction of 14,201 semantic relations, of which 5,475 (38.55%) were validated through back-translation. These findings confirm the potential of transferring semantic knowledge from English into Darija, despite cultural and lexical mismatches. The proposed pipeline substantially enriches Darija's lexical coverage and offers a scalable and replicable approach for developing semantic resources in other low-resource dialects. © 2025 IEEE.
2025
Istituto di linguistica computazionale "Antonio Zampolli" - ILC
Darija; NLP; NLTK; Ontology; semantic relations; sumo; Wordnet
File in questo prodotto:
File Dimensione Formato  
A_Proposed_Approach_for_Extracting_Semantic_and_Lexical_Relations_for_Low-Resource_Languages_A_Case_Study_of_Darija.pdf

solo utenti autorizzati

Tipologia: Versione Editoriale (PDF)
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 1.24 MB
Formato Adobe PDF
1.24 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/563028
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact