This paper describes the design of a tool for the automatic creation of multi-word lexica that is deployed as a web service and runs on automatically web-crawled data within the framework of the PANACEA platform. The main purpose of our task is to provide a (computationally "light") tool that creates a full high quality lexical resource of multi-word items. Within the platform, this tool is typically inserted in a work flow whose first step is automatic web-crawling. Therefore, the input data of our lexical extractor is intrinsically noisy. The paper evaluates the capacity of the tool to deal with noisy data, and in particular with texts containing a significant amount of duplicated paragraphs. The accuracy of the extraction of multi-word expressions from the original crawled corpus is compared to the accuracy of the extraction from a later "de-duplicated" version of the corpus. The paper shows how our method can extract with sufficiently good precision also from the original, noisy crawled data. The output of our tool is a multi-word lexicon formatted and encoded in XML according to the Lexical Mark-up Framework.
Automatic Creation of Quality Multi-Word Lexica from Noisy Text Data
Francesca Frontini;Valeria Quochi;
2012
Abstract
This paper describes the design of a tool for the automatic creation of multi-word lexica that is deployed as a web service and runs on automatically web-crawled data within the framework of the PANACEA platform. The main purpose of our task is to provide a (computationally "light") tool that creates a full high quality lexical resource of multi-word items. Within the platform, this tool is typically inserted in a work flow whose first step is automatic web-crawling. Therefore, the input data of our lexical extractor is intrinsically noisy. The paper evaluates the capacity of the tool to deal with noisy data, and in particular with texts containing a significant amount of duplicated paragraphs. The accuracy of the extraction of multi-word expressions from the original crawled corpus is compared to the accuracy of the extraction from a later "de-duplicated" version of the corpus. The paper shows how our method can extract with sufficiently good precision also from the original, noisy crawled data. The output of our tool is a multi-word lexicon formatted and encoded in XML according to the Lexical Mark-up Framework.| Campo DC | Valore | Lingua |
|---|---|---|
| dc.authority.orgunit | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | - |
| dc.authority.people | Francesca Frontini | it |
| dc.authority.people | Valeria Quochi | it |
| dc.authority.people | Francesco Rubino | it |
| dc.authority.project | Platform for Automatic, Normalized Annotation and Cost-Effective Acquisition of Language Resources for Human Language Technologies | - |
| dc.collection.id.s | 71c7200a-7c5f-4e83-8d57-d3d2ba88f40d | * |
| dc.collection.name | 04.01 Contributo in Atti di convegno | * |
| dc.contributor.appartenenza | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | * |
| dc.contributor.appartenenza.mi | 918 | * |
| dc.date.accessioned | 2024/02/16 15:54:38 | - |
| dc.date.available | 2024/02/16 15:54:38 | - |
| dc.date.issued | 2012 | - |
| dc.description.abstracteng | This paper describes the design of a tool for the automatic creation of multi-word lexica that is deployed as a web service and runs on automatically web-crawled data within the framework of the PANACEA platform. The main purpose of our task is to provide a (computationally "light") tool that creates a full high quality lexical resource of multi-word items. Within the platform, this tool is typically inserted in a work flow whose first step is automatic web-crawling. Therefore, the input data of our lexical extractor is intrinsically noisy. The paper evaluates the capacity of the tool to deal with noisy data, and in particular with texts containing a significant amount of duplicated paragraphs. The accuracy of the extraction of multi-word expressions from the original crawled corpus is compared to the accuracy of the extraction from a later "de-duplicated" version of the corpus. The paper shows how our method can extract with sufficiently good precision also from the original, noisy crawled data. The output of our tool is a multi-word lexicon formatted and encoded in XML according to the Lexical Mark-up Framework. | - |
| dc.description.affiliations | CNR-ILC, Pisa | - |
| dc.description.allpeople | Frontini, Francesca; Quochi, Valeria; Rubino, Francesco | - |
| dc.description.allpeopleoriginal | Francesca Frontini, Valeria Quochi, Francesco Rubino | - |
| dc.description.fulltext | none | en |
| dc.description.note | ID_PUMA: /cnr.ilc/2012-A3-008 | - |
| dc.description.numberofauthors | 3 | - |
| dc.identifier.isbn | 978-1-4503-1919-5 | - |
| dc.identifier.uri | https://hdl.handle.net/20.500.14243/128272 | - |
| dc.identifier.url | http://www.kde.cs.tut.ac.jp/~aono/pdf/COLING2012/AND/pdf/AND04.pdf | - |
| dc.language.iso | eng | - |
| dc.publisher.country | USA | - |
| dc.publisher.name | ACM, Association for computing machinery | - |
| dc.publisher.place | New York | - |
| dc.relation.conferencedate | December 9, 2012 | - |
| dc.relation.conferencename | AND 2012 | - |
| dc.relation.conferenceplace | Mumbai, India | - |
| dc.relation.ispartofbook | Proceedings of the Sixth Workshop on Analytics for Noisy Unstructured Text Data | - |
| dc.relation.projectAcronym | PANACEA | - |
| dc.relation.projectAwardNumber | 248064 | - |
| dc.relation.projectAwardTitle | Platform for Automatic, Normalized Annotation and Cost-Effective Acquisition of Language Resources for Human Language Technologies | - |
| dc.relation.projectFunderName | - | en |
| dc.relation.projectFundingStream | FP7 | - |
| dc.subject.keywords | Lexical induction | - |
| dc.subject.keywords | multi-word extraction | - |
| dc.subject.keywords | web-based distributed platform | - |
| dc.subject.keywords | noisy data | - |
| dc.subject.singlekeyword | Lexical induction | * |
| dc.subject.singlekeyword | multi-word extraction | * |
| dc.subject.singlekeyword | web-based distributed platform | * |
| dc.subject.singlekeyword | noisy data | * |
| dc.title | Automatic Creation of Quality Multi-Word Lexica from Noisy Text Data | en |
| dc.type.driver | info:eu-repo/semantics/conferenceObject | - |
| dc.type.full | 04 Contributo in convegno::04.01 Contributo in Atti di convegno | it |
| dc.type.miur | 273 | - |
| dc.type.referee | Sì, ma tipo non specificato | - |
| dc.ugov.descaux1 | 220785 | - |
| iris.orcid.lastModifiedDate | 2024/04/05 00:03:33 | * |
| iris.orcid.lastModifiedMillisecond | 1712268213934 | * |
| iris.sitodocente.maxattempts | 1 | - |
| Appare nelle tipologie: | 04.01 Contributo in Atti di convegno | |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


