This paper describes the design of a tool for the automatic creation of multi-word lexica that is deployed as a web service and runs on automatically web-crawled data within the framework of the PANACEA platform. The main purpose of our task is to provide a (computationally "light") tool that creates a full high quality lexical resource of multi-word items. Within the platform, this tool is typically inserted in a work flow whose first step is automatic web-crawling. Therefore, the input data of our lexical extractor is intrinsically noisy. The paper evaluates the capacity of the tool to deal with noisy data, and in particular with texts containing a significant amount of duplicated paragraphs. The accuracy of the extraction of multi-word expressions from the original crawled corpus is compared to the accuracy of the extraction from a later "de-duplicated" version of the corpus. The paper shows how our method can extract with sufficiently good precision also from the original, noisy crawled data. The output of our tool is a multi-word lexicon formatted and encoded in XML according to the Lexical Mark-up Framework.

Automatic Creation of Quality Multi-Word Lexica from Noisy Text Data

Francesca Frontini;Valeria Quochi;
2012

Abstract

This paper describes the design of a tool for the automatic creation of multi-word lexica that is deployed as a web service and runs on automatically web-crawled data within the framework of the PANACEA platform. The main purpose of our task is to provide a (computationally "light") tool that creates a full high quality lexical resource of multi-word items. Within the platform, this tool is typically inserted in a work flow whose first step is automatic web-crawling. Therefore, the input data of our lexical extractor is intrinsically noisy. The paper evaluates the capacity of the tool to deal with noisy data, and in particular with texts containing a significant amount of duplicated paragraphs. The accuracy of the extraction of multi-word expressions from the original crawled corpus is compared to the accuracy of the extraction from a later "de-duplicated" version of the corpus. The paper shows how our method can extract with sufficiently good precision also from the original, noisy crawled data. The output of our tool is a multi-word lexicon formatted and encoded in XML according to the Lexical Mark-up Framework.
Campo DC Valore Lingua
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC -
dc.authority.people Francesca Frontini it
dc.authority.people Valeria Quochi it
dc.authority.people Francesco Rubino it
dc.authority.project Platform for Automatic, Normalized Annotation and Cost-Effective Acquisition of Language Resources for Human Language Technologies -
dc.collection.id.s 71c7200a-7c5f-4e83-8d57-d3d2ba88f40d *
dc.collection.name 04.01 Contributo in Atti di convegno *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.date.accessioned 2024/02/16 15:54:38 -
dc.date.available 2024/02/16 15:54:38 -
dc.date.issued 2012 -
dc.description.abstracteng This paper describes the design of a tool for the automatic creation of multi-word lexica that is deployed as a web service and runs on automatically web-crawled data within the framework of the PANACEA platform. The main purpose of our task is to provide a (computationally "light") tool that creates a full high quality lexical resource of multi-word items. Within the platform, this tool is typically inserted in a work flow whose first step is automatic web-crawling. Therefore, the input data of our lexical extractor is intrinsically noisy. The paper evaluates the capacity of the tool to deal with noisy data, and in particular with texts containing a significant amount of duplicated paragraphs. The accuracy of the extraction of multi-word expressions from the original crawled corpus is compared to the accuracy of the extraction from a later "de-duplicated" version of the corpus. The paper shows how our method can extract with sufficiently good precision also from the original, noisy crawled data. The output of our tool is a multi-word lexicon formatted and encoded in XML according to the Lexical Mark-up Framework. -
dc.description.affiliations CNR-ILC, Pisa -
dc.description.allpeople Frontini, Francesca; Quochi, Valeria; Rubino, Francesco -
dc.description.allpeopleoriginal Francesca Frontini, Valeria Quochi, Francesco Rubino -
dc.description.fulltext none en
dc.description.note ID_PUMA: /cnr.ilc/2012-A3-008 -
dc.description.numberofauthors 3 -
dc.identifier.isbn 978-1-4503-1919-5 -
dc.identifier.uri https://hdl.handle.net/20.500.14243/128272 -
dc.identifier.url http://www.kde.cs.tut.ac.jp/~aono/pdf/COLING2012/AND/pdf/AND04.pdf -
dc.language.iso eng -
dc.publisher.country USA -
dc.publisher.name ACM, Association for computing machinery -
dc.publisher.place New York -
dc.relation.conferencedate December 9, 2012 -
dc.relation.conferencename AND 2012 -
dc.relation.conferenceplace Mumbai, India -
dc.relation.ispartofbook Proceedings of the Sixth Workshop on Analytics for Noisy Unstructured Text Data -
dc.relation.projectAcronym PANACEA -
dc.relation.projectAwardNumber 248064 -
dc.relation.projectAwardTitle Platform for Automatic, Normalized Annotation and Cost-Effective Acquisition of Language Resources for Human Language Technologies -
dc.relation.projectFunderName - en
dc.relation.projectFundingStream FP7 -
dc.subject.keywords Lexical induction -
dc.subject.keywords multi-word extraction -
dc.subject.keywords web-based distributed platform -
dc.subject.keywords noisy data -
dc.subject.singlekeyword Lexical induction *
dc.subject.singlekeyword multi-word extraction *
dc.subject.singlekeyword web-based distributed platform *
dc.subject.singlekeyword noisy data *
dc.title Automatic Creation of Quality Multi-Word Lexica from Noisy Text Data en
dc.type.driver info:eu-repo/semantics/conferenceObject -
dc.type.full 04 Contributo in convegno::04.01 Contributo in Atti di convegno it
dc.type.miur 273 -
dc.type.referee Sì, ma tipo non specificato -
dc.ugov.descaux1 220785 -
iris.orcid.lastModifiedDate 2024/04/05 00:03:33 *
iris.orcid.lastModifiedMillisecond 1712268213934 *
iris.sitodocente.maxattempts 1 -
Appare nelle tipologie: 04.01 Contributo in Atti di convegno
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/128272
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact