We have successfully adapted and extended the automatic Multilingual, Interoperable Named Entity Lexicon approach to Arabic, using Arabic WordNet (AWN) and Arabic Wikipedia (AWK). First, we extract AWN's instantiable nouns and identify the corresponding categories and hyponym subcategories in AWK. Then, we exploit Wikipedia inter-lingual links to locate correspondences between articles in ten different languages in order to identify Named Entities (NEs). We apply keyword search on AWK abstracts to provide for Arabic articles that do not have a correspondence in any of the other languages. In addition, we perform a post-processing step to fetch further NEs from AWK not reachable through AWN. Finally, we investigate diacritization using matching with geonames databases, MADA-TOKAN tools and different heuristics for restoring vowel marks of Arabic NEs. Using this methodology, we have extracted approximately 45,000 Arabic NEs and built, to the best of our knowledge, the largest Named Entity repository available.

An Automatically Built Named Entity Lexicon for Arabic

Monachini M;
2010

Abstract

We have successfully adapted and extended the automatic Multilingual, Interoperable Named Entity Lexicon approach to Arabic, using Arabic WordNet (AWN) and Arabic Wikipedia (AWK). First, we extract AWN's instantiable nouns and identify the corresponding categories and hyponym subcategories in AWK. Then, we exploit Wikipedia inter-lingual links to locate correspondences between articles in ten different languages in order to identify Named Entities (NEs). We apply keyword search on AWK abstracts to provide for Arabic articles that do not have a correspondence in any of the other languages. In addition, we perform a post-processing step to fetch further NEs from AWK not reachable through AWN. Finally, we investigate diacritization using matching with geonames databases, MADA-TOKAN tools and different heuristics for restoring vowel marks of Arabic NEs. Using this methodology, we have extracted approximately 45,000 Arabic NEs and built, to the best of our knowledge, the largest Named Entity repository available.
Campo DC Valore Lingua
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC -
dc.authority.people Attia M it
dc.authority.people Toral A it
dc.authority.people Tounsi L it
dc.authority.people Monachini M it
dc.authority.people Van Genabith J it
dc.collection.id.s 71c7200a-7c5f-4e83-8d57-d3d2ba88f40d *
dc.collection.name 04.01 Contributo in Atti di convegno *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.date.accessioned 2024/02/19 20:04:04 -
dc.date.available 2024/02/19 20:04:04 -
dc.date.issued 2010 -
dc.description.abstract We have successfully adapted and extended the automatic Multilingual, Interoperable Named Entity Lexicon approach to Arabic, using Arabic WordNet (AWN) and Arabic Wikipedia (AWK). First, we extract AWN's instantiable nouns and identify the corresponding categories and hyponym subcategories in AWK. Then, we exploit Wikipedia inter-lingual links to locate correspondences between articles in ten different languages in order to identify Named Entities (NEs). We apply keyword search on AWK abstracts to provide for Arabic articles that do not have a correspondence in any of the other languages. In addition, we perform a post-processing step to fetch further NEs from AWK not reachable through AWN. Finally, we investigate diacritization using matching with geonames databases, MADA-TOKAN tools and different heuristics for restoring vowel marks of Arabic NEs. Using this methodology, we have extracted approximately 45,000 Arabic NEs and built, to the best of our knowledge, the largest Named Entity repository available. -
dc.description.affiliations NCLT, School of Computing, Dublin City University, Ireland, ILC-CNR, Pisa -
dc.description.allpeople Attia, M; Toral, A; Tounsi, L; Monachini, M; Van Genabith, J -
dc.description.allpeopleoriginal Attia M.; Toral A.; Tounsi L.; Monachini M.; Van Genabith J. -
dc.description.fulltext none en
dc.description.numberofauthors 5 -
dc.identifier.isbn 2-9517408-6-7 -
dc.identifier.uri https://hdl.handle.net/20.500.14243/65154 -
dc.relation.conferencename Seventh International Conference on Language Resources and Evaluation -
dc.relation.conferenceplace Valletta, Malta -
dc.subject.keywords Acquisition -
dc.subject.keywords Lexicon -
dc.subject.keywords database -
dc.subject.keywords Named Entity recognition -
dc.subject.singlekeyword Acquisition *
dc.subject.singlekeyword Lexicon *
dc.subject.singlekeyword database *
dc.subject.singlekeyword Named Entity recognition *
dc.title An Automatically Built Named Entity Lexicon for Arabic en
dc.type.driver info:eu-repo/semantics/conferenceObject -
dc.type.full 04 Contributo in convegno::04.01 Contributo in Atti di convegno it
dc.type.miur 273 -
dc.type.referee Sì, ma tipo non specificato -
dc.ugov.descaux1 84787 -
iris.orcid.lastModifiedDate 2024/04/04 11:56:55 *
iris.orcid.lastModifiedMillisecond 1712224615861 *
iris.sitodocente.maxattempts 1 -
Appare nelle tipologie: 04.01 Contributo in Atti di convegno
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/65154
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact