CNR Institutional Research Information System

We have successfully adapted and extended the automatic Multilingual, Interoperable Named Entity Lexicon approach to Arabic, using Arabic WordNet (AWN) and Arabic Wikipedia (AWK). First, we extract AWN's instantiable nouns and identify the corresponding categories and hyponym subcategories in AWK. Then, we exploit Wikipedia inter-lingual links to locate correspondences between articles in ten different languages in order to identify Named Entities (NEs). We apply keyword search on AWK abstracts to provide for Arabic articles that do not have a correspondence in any of the other languages. In addition, we perform a post-processing step to fetch further NEs from AWK not reachable through AWN. Finally, we investigate diacritization using matching with geonames databases, MADA-TOKAN tools and different heuristics for restoring vowel marks of Arabic NEs. Using this methodology, we have extracted approximately 45,000 Arabic NEs and built, to the best of our knowledge, the largest Named Entity repository available.

An Automatically Built Named Entity Lexicon for Arabic

Attia M;Toral A;Tounsi L;Monachini M;Van Genabith J

2010

Abstract

We have successfully adapted and extended the automatic Multilingual, Interoperable Named Entity Lexicon approach to Arabic, using Arabic WordNet (AWN) and Arabic Wikipedia (AWK). First, we extract AWN's instantiable nouns and identify the corresponding categories and hyponym subcategories in AWK. Then, we exploit Wikipedia inter-lingual links to locate correspondences between articles in ten different languages in order to identify Named Entities (NEs). We apply keyword search on AWK abstracts to provide for Arabic articles that do not have a correspondence in any of the other languages. In addition, we perform a post-processing step to fetch further NEs from AWK not reachable through AWN. Finally, we investigate diacritization using matching with geonames databases, MADA-TOKAN tools and different heuristics for restoring vowel marks of Arabic NEs. Using this methodology, we have extracted approximately 45,000 Arabic NEs and built, to the best of our knowledge, the largest Named Entity repository available.

Scheda breve

Scheda completa

Scheda completa (DC)

Campo DC	Valore	Lingua
dc.authority.orgunit	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	-
dc.authority.people	Attia M	it
dc.authority.people	Toral A	it
dc.authority.people	Tounsi L	it
dc.authority.people	Monachini M	it
dc.authority.people	Van Genabith J	it
dc.collection.id.s	71c7200a-7c5f-4e83-8d57-d3d2ba88f40d	*
dc.collection.name	04.01 Contributo in Atti di convegno	*
dc.contributor.appartenenza	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	*
dc.contributor.appartenenza.mi	918	*
dc.date.accessioned	2024/02/19 20:04:04	-
dc.date.available	2024/02/19 20:04:04	-
dc.date.issued	2010	-
dc.description.abstract	We have successfully adapted and extended the automatic Multilingual, Interoperable Named Entity Lexicon approach to Arabic, using Arabic WordNet (AWN) and Arabic Wikipedia (AWK). First, we extract AWN's instantiable nouns and identify the corresponding categories and hyponym subcategories in AWK. Then, we exploit Wikipedia inter-lingual links to locate correspondences between articles in ten different languages in order to identify Named Entities (NEs). We apply keyword search on AWK abstracts to provide for Arabic articles that do not have a correspondence in any of the other languages. In addition, we perform a post-processing step to fetch further NEs from AWK not reachable through AWN. Finally, we investigate diacritization using matching with geonames databases, MADA-TOKAN tools and different heuristics for restoring vowel marks of Arabic NEs. Using this methodology, we have extracted approximately 45,000 Arabic NEs and built, to the best of our knowledge, the largest Named Entity repository available.	-
dc.description.affiliations	NCLT, School of Computing, Dublin City University, Ireland, ILC-CNR, Pisa	-
dc.description.allpeople	Attia, M; Toral, A; Tounsi, L; Monachini, M; Van Genabith, J	-
dc.description.allpeopleoriginal	Attia M.; Toral A.; Tounsi L.; Monachini M.; Van Genabith J.	-
dc.description.fulltext	none	en
dc.description.numberofauthors	5	-
dc.identifier.isbn	2-9517408-6-7	-
dc.identifier.uri	https://hdl.handle.net/20.500.14243/65154	-
dc.relation.conferencename	Seventh International Conference on Language Resources and Evaluation	-
dc.relation.conferenceplace	Valletta, Malta	-
dc.subject.keywords	Acquisition	-
dc.subject.keywords	Lexicon	-
dc.subject.keywords	database	-
dc.subject.keywords	Named Entity recognition	-
dc.subject.singlekeyword	Acquisition	*
dc.subject.singlekeyword	Lexicon	*
dc.subject.singlekeyword	database	*
dc.subject.singlekeyword	Named Entity recognition	*
dc.title	An Automatically Built Named Entity Lexicon for Arabic	en
dc.type.driver	info:eu-repo/semantics/conferenceObject	-
dc.type.full	04 Contributo in convegno::04.01 Contributo in Atti di convegno	it
dc.type.miur	273	-
dc.type.referee	Sì, ma tipo non specificato	-
dc.ugov.descaux1	84787	-
iris.orcid.lastModifiedDate	2024/04/04 11:56:55	*
iris.orcid.lastModifiedMillisecond	1712224615861	*
iris.sitodocente.maxattempts	1	-
Appare nelle tipologie:	04.01 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/65154

Citazioni

ND

ND

ND

social impact