<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="static/CINECAstyle.xsl"?><OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2026-05-10T23:23:57Z</responseDate><request verb="GetRecord" identifier="oai:iris.cnr.it:20.500.14243/128272" metadataPrefix="oai_dc">https://iris.cnr.it/oai/request</request><GetRecord><record><header><identifier>oai:iris.cnr.it:20.500.14243/128272</identifier><datestamp>2024-06-08T13:40:34Z</datestamp><setSpec>com_20.500.14243_46</setSpec><setSpec>com_20.500.14243_21</setSpec><setSpec>col_20.500.14243_47</setSpec><setSpec>ou_ou239</setSpec></header><metadata><oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Automatic Creation of Quality Multi-Word Lexica from Noisy Text Data</dc:title>
<dc:creator>Francesca Frontini</dc:creator>
<dc:creator>Valeria Quochi</dc:creator>
<dc:creator>Francesco Rubino</dc:creator>
<dc:contributor>Frontini, Francesca</dc:contributor>
<dc:contributor> Quochi, Valeria</dc:contributor>
<dc:contributor> Rubino, Francesco</dc:contributor>
<dc:subject>Lexical induction</dc:subject>
<dc:subject>multi-word extraction</dc:subject>
<dc:subject>web-based distributed platform</dc:subject>
<dc:subject>noisy data</dc:subject>
<dc:description>This paper describes the design of a tool for the automatic creation of multi-word lexica that is deployed as a web service and runs on automatically web-crawled data within the framework of the PANACEA platform. The main purpose of our task is to provide a (computationally "light") tool that creates a full high quality lexical resource of multi-word items. Within the platform, this tool is typically inserted in a work flow whose first step is automatic web-crawling. Therefore, the input data of our lexical extractor is intrinsically noisy. The paper evaluates the capacity of the tool to deal with noisy data, and in particular with texts containing a significant amount of duplicated paragraphs. The accuracy of the extraction of multi-word expressions from the original crawled corpus is compared to the accuracy of the extraction from a later "de-duplicated" version of the corpus. The paper shows how our method can extract with sufficiently good precision also from the original, noisy crawled data. The output of our tool is a multi-word lexicon formatted and encoded in XML according to the Lexical Mark-up Framework.</dc:description>
<dc:date>2012</dc:date>
<dc:type>info:eu-repo/semantics/conferenceObject</dc:type>
<dc:identifier>https://hdl.handle.net/20.500.14243/128272</dc:identifier>
<dc:relation>info:eu-repo/semantics/altIdentifier/isbn/978-1-4503-1919-5</dc:relation>
<dc:identifier>http://www.kde.cs.tut.ac.jp/~aono/pdf/COLING2012/AND/pdf/AND04.pdf</dc:identifier>
<dc:language>eng</dc:language>
<dc:relation>ispartofbook:Proceedings of the Sixth Workshop on Analytics for Noisy Unstructured Text Data</dc:relation>
<dc:relation>AND 2012</dc:relation>
<dc:relation>info:eu-repo/grantAgreement/EC/FP7/Platform for Automatic, Normalized Annotation and Cost-Effective Acquisition of Language Resources for Human Language Technologies</dc:relation>
<dc:publisher>ACM, Association for computing machinery</dc:publisher>
<dc:publisher>country:USA</dc:publisher>
<dc:publisher>place:New York</dc:publisher>
</oai_dc:dc></metadata></record></GetRecord></OAI-PMH>