PANACEA WP4 targets the creation of a Corpus Acquisition and Annotation (CAA) subsystem for the acquisition and processing of monolingual and bilingual language resources (LRs). The CAA subsystem consists of tools that have been integrated as web services in the PANACEA platform of LR production. D4.2 Initial functional prototype and documentation in T13 and D4.4 Report on the revised Corpus Acquisition & Annotation subsystem and its components in T23 provided initial and updated documentation on this subsystem, while this deliverable presents the final documentation of the subsystem as it evolved after the third development cycle of the project. The deliverable is structured as follows. The Corpus Acquisition Component (i.e. the Focused Monolingual and Bilingual Crawlers (FMC/FBC)) is described in section 2. The final list of tools for corpus normalization (cleaning and de-duplication) is detailed in section 3. Section 4 provides documentation on all NLP tools included in the subsystem. Due to its nature, this deliverable aggregates considerable parts of all previous WP4 deliverables. The main new additions include a) new functionalities for, among others, crawling strategy, de-duplication, and detection of parallel document pairs; and b) new NLP tools for syntactic analysis, named entity recognition, tweet processing and anonymization.

D4.5 Final Report on the Corpus Acquisition & Annotation subsystem and its components

Frontini Francesca;
2012

Abstract

PANACEA WP4 targets the creation of a Corpus Acquisition and Annotation (CAA) subsystem for the acquisition and processing of monolingual and bilingual language resources (LRs). The CAA subsystem consists of tools that have been integrated as web services in the PANACEA platform of LR production. D4.2 Initial functional prototype and documentation in T13 and D4.4 Report on the revised Corpus Acquisition & Annotation subsystem and its components in T23 provided initial and updated documentation on this subsystem, while this deliverable presents the final documentation of the subsystem as it evolved after the third development cycle of the project. The deliverable is structured as follows. The Corpus Acquisition Component (i.e. the Focused Monolingual and Bilingual Crawlers (FMC/FBC)) is described in section 2. The final list of tools for corpus normalization (cleaning and de-duplication) is detailed in section 3. Section 4 provides documentation on all NLP tools included in the subsystem. Due to its nature, this deliverable aggregates considerable parts of all previous WP4 deliverables. The main new additions include a) new functionalities for, among others, crawling strategy, de-duplication, and detection of parallel document pairs; and b) new NLP tools for syntactic analysis, named entity recognition, tweet processing and anonymization.
Campo DC Valore Lingua
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC -
dc.authority.people Prokopidis Prokopis it
dc.authority.people Papavassiliou Vassilis it
dc.authority.people Toral Antonio it
dc.authority.people Poch Riera Marc it
dc.authority.people Frontini Francesca it
dc.authority.people Rubino Francesco it
dc.authority.people Thurmair Gregor it
dc.collection.id.s f3ccd2f0-452a-4e09-bfb4-66369d480d48 *
dc.collection.name 08.02 Rapporto di ricerca, Relazione scientifica *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.date.accessioned 2024/02/16 17:59:20 -
dc.date.available 2024/02/16 17:59:20 -
dc.date.issued 2012 -
dc.description.abstracteng PANACEA WP4 targets the creation of a Corpus Acquisition and Annotation (CAA) subsystem for the acquisition and processing of monolingual and bilingual language resources (LRs). The CAA subsystem consists of tools that have been integrated as web services in the PANACEA platform of LR production. D4.2 Initial functional prototype and documentation in T13 and D4.4 Report on the revised Corpus Acquisition & Annotation subsystem and its components in T23 provided initial and updated documentation on this subsystem, while this deliverable presents the final documentation of the subsystem as it evolved after the third development cycle of the project. The deliverable is structured as follows. The Corpus Acquisition Component (i.e. the Focused Monolingual and Bilingual Crawlers (FMC/FBC)) is described in section 2. The final list of tools for corpus normalization (cleaning and de-duplication) is detailed in section 3. Section 4 provides documentation on all NLP tools included in the subsystem. Due to its nature, this deliverable aggregates considerable parts of all previous WP4 deliverables. The main new additions include a) new functionalities for, among others, crawling strategy, de-duplication, and detection of parallel document pairs; and b) new NLP tools for syntactic analysis, named entity recognition, tweet processing and anonymization. -
dc.description.affiliations [1] ILSP "Athena" R.C., Greece; [2] Dublin City University, Ireland; [3] CNR-ILC, Pisa -
dc.description.allpeople Prokopidis, Prokopis; Papavassiliou, Vassilis; Toral, Antonio; Poch Riera, Marc; Frontini, Francesca; Rubino, Francesco; Thurmair, Gregor -
dc.description.allpeopleoriginal Prokopidis, Prokopis [1]; Papavassiliou, Vassilis [1]; Toral, Antonio [2]; Poch Riera, Marc; Frontini, Francesca [3]; Rubino, Francesco [3]; Thurmair, Gregor -
dc.description.fulltext none en
dc.description.note ID_PUMA: /cnr.ilc/2012-EC-002 -
dc.description.numberofauthors 7 -
dc.identifier.uri https://hdl.handle.net/20.500.14243/129408 -
dc.identifier.url http://www.jotform.com/uploads/fabioaffeilc/30222975566357/225350067351490116/PANACEA -
dc.language.iso eng -
dc.subject.keywords Corpus Acquisition -
dc.subject.singlekeyword Corpus Acquisition *
dc.title D4.5 Final Report on the Corpus Acquisition & Annotation subsystem and its components en
dc.type.driver info:eu-repo/semantics/other -
dc.type.full 08 Report e Working Paper::08.02 Rapporto di ricerca, Relazione scientifica it
dc.type.miur -2.0 -
dc.ugov.descaux1 221582 -
iris.orcid.lastModifiedDate 2024/04/04 10:00:18 *
iris.orcid.lastModifiedMillisecond 1712217618441 *
iris.sitodocente.maxattempts 1 -
Appare nelle tipologie: 08.02 Rapporto di ricerca, Relazione scientifica
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/129408
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact