CNR Institutional Research Information System

This paper describes the design of a tool for the automatic creation of multi-word lexica that is deployed as a web service and runs on automatically web-crawled data within the framework of the PANACEA platform. The main purpose of our task is to provide a (computationally "light") tool that creates a full high quality lexical resource of multi-word items. Within the platform, this tool is typically inserted in a work flow whose first step is automatic web-crawling. Therefore, the input data of our lexical extractor is intrinsically noisy. The paper evaluates the capacity of the tool to deal with noisy data, and in particular with texts containing a significant amount of duplicated paragraphs. The accuracy of the extraction of multi-word expressions from the original crawled corpus is compared to the accuracy of the extraction from a later "de-duplicated" version of the corpus. The paper shows how our method can extract with sufficiently good precision also from the original, noisy crawled data. The output of our tool is a multi-word lexicon formatted and encoded in XML according to the Lexical Mark-up Framework.

Automatic Creation of Quality Multi-Word Lexica from Noisy Text Data

Francesca Frontini;Valeria Quochi;Francesco Rubino

2012

Abstract

This paper describes the design of a tool for the automatic creation of multi-word lexica that is deployed as a web service and runs on automatically web-crawled data within the framework of the PANACEA platform. The main purpose of our task is to provide a (computationally "light") tool that creates a full high quality lexical resource of multi-word items. Within the platform, this tool is typically inserted in a work flow whose first step is automatic web-crawling. Therefore, the input data of our lexical extractor is intrinsically noisy. The paper evaluates the capacity of the tool to deal with noisy data, and in particular with texts containing a significant amount of duplicated paragraphs. The accuracy of the extraction of multi-word expressions from the original crawled corpus is compared to the accuracy of the extraction from a later "de-duplicated" version of the corpus. The paper shows how our method can extract with sufficiently good precision also from the original, noisy crawled data. The output of our tool is a multi-word lexicon formatted and encoded in XML according to the Lexical Mark-up Framework.

Scheda breve

Scheda completa

Scheda completa (DC)

Campo DC	Valore	Lingua
dc.authority.orgunit	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	-
dc.authority.people	Francesca Frontini	it
dc.authority.people	Valeria Quochi	it
dc.authority.people	Francesco Rubino	it
dc.authority.project	Platform for Automatic, Normalized Annotation and Cost-Effective Acquisition of Language Resources for Human Language Technologies	-
dc.collection.id.s	71c7200a-7c5f-4e83-8d57-d3d2ba88f40d	*
dc.collection.name	04.01 Contributo in Atti di convegno	*
dc.contributor.appartenenza	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	*
dc.contributor.appartenenza.mi	918	*
dc.date.accessioned	2024/02/16 15:54:38	-
dc.date.available	2024/02/16 15:54:38	-
dc.date.issued	2012	-
dc.description.abstracteng	This paper describes the design of a tool for the automatic creation of multi-word lexica that is deployed as a web service and runs on automatically web-crawled data within the framework of the PANACEA platform. The main purpose of our task is to provide a (computationally "light") tool that creates a full high quality lexical resource of multi-word items. Within the platform, this tool is typically inserted in a work flow whose first step is automatic web-crawling. Therefore, the input data of our lexical extractor is intrinsically noisy. The paper evaluates the capacity of the tool to deal with noisy data, and in particular with texts containing a significant amount of duplicated paragraphs. The accuracy of the extraction of multi-word expressions from the original crawled corpus is compared to the accuracy of the extraction from a later "de-duplicated" version of the corpus. The paper shows how our method can extract with sufficiently good precision also from the original, noisy crawled data. The output of our tool is a multi-word lexicon formatted and encoded in XML according to the Lexical Mark-up Framework.	-
dc.description.affiliations	CNR-ILC, Pisa	-
dc.description.allpeople	Frontini, Francesca; Quochi, Valeria; Rubino, Francesco	-
dc.description.allpeopleoriginal	Francesca Frontini, Valeria Quochi, Francesco Rubino	-
dc.description.fulltext	none	en
dc.description.note	ID_PUMA: /cnr.ilc/2012-A3-008	-
dc.description.numberofauthors	3	-
dc.identifier.isbn	978-1-4503-1919-5	-
dc.identifier.uri	https://hdl.handle.net/20.500.14243/128272	-
dc.identifier.url	http://www.kde.cs.tut.ac.jp/~aono/pdf/COLING2012/AND/pdf/AND04.pdf	-
dc.language.iso	eng	-
dc.publisher.country	USA	-
dc.publisher.name	ACM, Association for computing machinery	-
dc.publisher.place	New York	-
dc.relation.conferencedate	December 9, 2012	-
dc.relation.conferencename	AND 2012	-
dc.relation.conferenceplace	Mumbai, India	-
dc.relation.ispartofbook	Proceedings of the Sixth Workshop on Analytics for Noisy Unstructured Text Data	-
dc.relation.projectAcronym	PANACEA	-
dc.relation.projectAwardNumber	248064	-
dc.relation.projectAwardTitle	Platform for Automatic, Normalized Annotation and Cost-Effective Acquisition of Language Resources for Human Language Technologies	-
dc.relation.projectFunderName	-	en
dc.relation.projectFundingStream	FP7	-
dc.subject.keywords	Lexical induction	-
dc.subject.keywords	multi-word extraction	-
dc.subject.keywords	web-based distributed platform	-
dc.subject.keywords	noisy data	-
dc.subject.singlekeyword	Lexical induction	*
dc.subject.singlekeyword	multi-word extraction	*
dc.subject.singlekeyword	web-based distributed platform	*
dc.subject.singlekeyword	noisy data	*
dc.title	Automatic Creation of Quality Multi-Word Lexica from Noisy Text Data	en
dc.type.driver	info:eu-repo/semantics/conferenceObject	-
dc.type.full	04 Contributo in convegno::04.01 Contributo in Atti di convegno	it
dc.type.miur	273	-
dc.type.referee	Sì, ma tipo non specificato	-
dc.ugov.descaux1	220785	-
iris.orcid.lastModifiedDate	2024/04/05 00:03:33	*
iris.orcid.lastModifiedMillisecond	1712268213934	*
iris.sitodocente.maxattempts	1	-
Appare nelle tipologie:	04.01 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/128272

Citazioni

ND

ND

ND

social impact