This article presents the use of NLP techniques (text mining, text analysis) to develop specific tools that allow to create linguistic resources related to the cultural heritage domain. The aim of our approach is to create tools for the building of an online "knowledge network", automatically extracted from text materials concerning this domain. A particular methodology was experimented by dividing the automatic acquisition of texts, and consequently, the creation of reference corpus in two phases. In the first phase, on-line documents have been extracted from lists of links provided by human experts. All documents extracted from the web by means of automatic spider have been stored in a repository of text materials. On the basis of these documents, automatic parsers create the reference corpus for the cultural heritage domain. Relevant information and semantic concepts are then extracted from this corpus. In a second phase, all these semantically relevant elements (such as proper names, names of institutions, names of places, and other relevant terms) have been used as basis for a new search strategy of text materials from heterogeneous sources. In this case also specialized crawlers (TP-crawler) have been used to work on a bulk of text materials available on line.

Cultural Heritage: Knowledge Extraction from Web Documents

Sassolini E;Cinini A
2010

Abstract

This article presents the use of NLP techniques (text mining, text analysis) to develop specific tools that allow to create linguistic resources related to the cultural heritage domain. The aim of our approach is to create tools for the building of an online "knowledge network", automatically extracted from text materials concerning this domain. A particular methodology was experimented by dividing the automatic acquisition of texts, and consequently, the creation of reference corpus in two phases. In the first phase, on-line documents have been extracted from lists of links provided by human experts. All documents extracted from the web by means of automatic spider have been stored in a repository of text materials. On the basis of these documents, automatic parsers create the reference corpus for the cultural heritage domain. Relevant information and semantic concepts are then extracted from this corpus. In a second phase, all these semantically relevant elements (such as proper names, names of institutions, names of places, and other relevant terms) have been used as basis for a new search strategy of text materials from heterogeneous sources. In this case also specialized crawlers (TP-crawler) have been used to work on a bulk of text materials available on line.
Campo DC Valore Lingua
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC -
dc.authority.people Sassolini E it
dc.authority.people Cinini A it
dc.collection.id.s 71c7200a-7c5f-4e83-8d57-d3d2ba88f40d *
dc.collection.name 04.01 Contributo in Atti di convegno *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.date.accessioned 2024/02/19 19:58:50 -
dc.date.available 2024/02/19 19:58:50 -
dc.date.issued 2010 -
dc.description.abstracteng This article presents the use of NLP techniques (text mining, text analysis) to develop specific tools that allow to create linguistic resources related to the cultural heritage domain. The aim of our approach is to create tools for the building of an online "knowledge network", automatically extracted from text materials concerning this domain. A particular methodology was experimented by dividing the automatic acquisition of texts, and consequently, the creation of reference corpus in two phases. In the first phase, on-line documents have been extracted from lists of links provided by human experts. All documents extracted from the web by means of automatic spider have been stored in a repository of text materials. On the basis of these documents, automatic parsers create the reference corpus for the cultural heritage domain. Relevant information and semantic concepts are then extracted from this corpus. In a second phase, all these semantically relevant elements (such as proper names, names of institutions, names of places, and other relevant terms) have been used as basis for a new search strategy of text materials from heterogeneous sources. In this case also specialized crawlers (TP-crawler) have been used to work on a bulk of text materials available on line. -
dc.description.affiliations ILC-CNR, Pisa -
dc.description.allpeople Sassolini, E; Cinini, A -
dc.description.allpeopleoriginal Sassolini E.; Cinini A. -
dc.description.fulltext none en
dc.description.note Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10). Electronic only. WOS 000356879508023 -
dc.description.numberofauthors 2 -
dc.identifier.isbn 978-2-9517408-6-0 -
dc.identifier.uri https://hdl.handle.net/20.500.14243/65138 -
dc.language.iso eng -
dc.relation.conferencedate 17-23/05/2010 -
dc.relation.conferencename Seventh International Conference on Language Resources and Evaluation -
dc.relation.conferenceplace Valletta, Malta -
dc.relation.firstpage 3363 -
dc.relation.lastpage 3368 -
dc.relation.numberofpages 6 -
dc.subject.keywords Information Extraction -
dc.subject.keywords Information Retrieval -
dc.subject.keywords Text mining -
dc.subject.keywords Named Entity recognition -
dc.subject.singlekeyword Information Extraction *
dc.subject.singlekeyword Information Retrieval *
dc.subject.singlekeyword Text mining *
dc.subject.singlekeyword Named Entity recognition *
dc.title Cultural Heritage: Knowledge Extraction from Web Documents en
dc.type.driver info:eu-repo/semantics/conferenceObject -
dc.type.full 04 Contributo in convegno::04.01 Contributo in Atti di convegno it
dc.type.miur 273 -
dc.type.referee Sì, ma tipo non specificato -
dc.ugov.descaux1 84768 -
iris.orcid.lastModifiedDate 2024/04/04 15:56:08 *
iris.orcid.lastModifiedMillisecond 1712238968413 *
iris.sitodocente.maxattempts 2 -
Appare nelle tipologie: 04.01 Contributo in Atti di convegno
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/65138
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact