We have successfully adapted and extended the automatic Multilingual, Interoperable Named Entity Lexicon approach to Arabic, using Arabic WordNet (AWN) and Arabic Wikipedia (AWK). First, we extract AWN's instantiable nouns and identify the corresponding categories and hyponym subcategories in AWK. Then, we exploit Wikipedia inter-lingual links to locate correspondences between articles in ten different languages in order to identify Named Entities (NEs). We apply keyword search on AWK abstracts to provide for Arabic articles that do not have a correspondence in any of the other languages. In addition, we perform a post-processing step to fetch further NEs from AWK not reachable through AWN. Finally, we investigate diacritization using matching with geonames databases, MADA-TOKAN tools and different heuristics for restoring vowel marks of Arabic NEs. Using this methodology, we have extracted approximately 45,000 Arabic NEs and built, to the best of our knowledge, the largest Named Entity repository available.
An Automatically Built Named Entity Lexicon for Arabic
Monachini M;
2010
Abstract
We have successfully adapted and extended the automatic Multilingual, Interoperable Named Entity Lexicon approach to Arabic, using Arabic WordNet (AWN) and Arabic Wikipedia (AWK). First, we extract AWN's instantiable nouns and identify the corresponding categories and hyponym subcategories in AWK. Then, we exploit Wikipedia inter-lingual links to locate correspondences between articles in ten different languages in order to identify Named Entities (NEs). We apply keyword search on AWK abstracts to provide for Arabic articles that do not have a correspondence in any of the other languages. In addition, we perform a post-processing step to fetch further NEs from AWK not reachable through AWN. Finally, we investigate diacritization using matching with geonames databases, MADA-TOKAN tools and different heuristics for restoring vowel marks of Arabic NEs. Using this methodology, we have extracted approximately 45,000 Arabic NEs and built, to the best of our knowledge, the largest Named Entity repository available.| Campo DC | Valore | Lingua |
|---|---|---|
| dc.authority.orgunit | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | - |
| dc.authority.people | Attia M | it |
| dc.authority.people | Toral A | it |
| dc.authority.people | Tounsi L | it |
| dc.authority.people | Monachini M | it |
| dc.authority.people | Van Genabith J | it |
| dc.collection.id.s | 71c7200a-7c5f-4e83-8d57-d3d2ba88f40d | * |
| dc.collection.name | 04.01 Contributo in Atti di convegno | * |
| dc.contributor.appartenenza | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | * |
| dc.contributor.appartenenza.mi | 918 | * |
| dc.date.accessioned | 2024/02/19 20:04:04 | - |
| dc.date.available | 2024/02/19 20:04:04 | - |
| dc.date.issued | 2010 | - |
| dc.description.abstract | We have successfully adapted and extended the automatic Multilingual, Interoperable Named Entity Lexicon approach to Arabic, using Arabic WordNet (AWN) and Arabic Wikipedia (AWK). First, we extract AWN's instantiable nouns and identify the corresponding categories and hyponym subcategories in AWK. Then, we exploit Wikipedia inter-lingual links to locate correspondences between articles in ten different languages in order to identify Named Entities (NEs). We apply keyword search on AWK abstracts to provide for Arabic articles that do not have a correspondence in any of the other languages. In addition, we perform a post-processing step to fetch further NEs from AWK not reachable through AWN. Finally, we investigate diacritization using matching with geonames databases, MADA-TOKAN tools and different heuristics for restoring vowel marks of Arabic NEs. Using this methodology, we have extracted approximately 45,000 Arabic NEs and built, to the best of our knowledge, the largest Named Entity repository available. | - |
| dc.description.affiliations | NCLT, School of Computing, Dublin City University, Ireland, ILC-CNR, Pisa | - |
| dc.description.allpeople | Attia, M; Toral, A; Tounsi, L; Monachini, M; Van Genabith, J | - |
| dc.description.allpeopleoriginal | Attia M.; Toral A.; Tounsi L.; Monachini M.; Van Genabith J. | - |
| dc.description.fulltext | none | en |
| dc.description.numberofauthors | 5 | - |
| dc.identifier.isbn | 2-9517408-6-7 | - |
| dc.identifier.uri | https://hdl.handle.net/20.500.14243/65154 | - |
| dc.relation.conferencename | Seventh International Conference on Language Resources and Evaluation | - |
| dc.relation.conferenceplace | Valletta, Malta | - |
| dc.subject.keywords | Acquisition | - |
| dc.subject.keywords | Lexicon | - |
| dc.subject.keywords | database | - |
| dc.subject.keywords | Named Entity recognition | - |
| dc.subject.singlekeyword | Acquisition | * |
| dc.subject.singlekeyword | Lexicon | * |
| dc.subject.singlekeyword | database | * |
| dc.subject.singlekeyword | Named Entity recognition | * |
| dc.title | An Automatically Built Named Entity Lexicon for Arabic | en |
| dc.type.driver | info:eu-repo/semantics/conferenceObject | - |
| dc.type.full | 04 Contributo in convegno::04.01 Contributo in Atti di convegno | it |
| dc.type.miur | 273 | - |
| dc.type.referee | Sì, ma tipo non specificato | - |
| dc.ugov.descaux1 | 84787 | - |
| iris.orcid.lastModifiedDate | 2024/04/04 11:56:55 | * |
| iris.orcid.lastModifiedMillisecond | 1712224615861 | * |
| iris.sitodocente.maxattempts | 1 | - |
| Appare nelle tipologie: | 04.01 Contributo in Atti di convegno | |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


