This paper describes the process of acquisition, cleaning, interpretation, coding and linguistic annotation of a collection of parliamentary debates from the Senate of the Italian Republic covering the COVID-19 pandemic emergency period and a former period for reference and comparison according to the CLARIN ParlaMint prescriptions. The corpus contains 1199 sessions and 79,373 speeches for a total of about 31 million words, and was encoded according to the ParlaCLARIN TEI XML format. It includes extensive metadata about the speakers, sessions, political parties and parliamentary groups. As required by the ParlaMint initiative, the corpus was also linguistically annotated for sentences, tokens, POS tags, lemmas and dependency syntax according to the universal dependencies guidelines. Named entity annotation and classification is also included. All linguistic annotation was performed automatically using state-of-the-art NLP technology with no manual revision. The Italian dataset is freely available as part of the larger ParlaMint 2.1 corpus deposited and archived in CLARIN repository together with all other national corpora. It is also available for direct analysis and inspection via various CLARIN services and has already been used both for research and educational purposes.

Making Italian Parliamentary Records Machine-Actionable: the Construction of the ParlaMint-IT corpus

Tommaso Agnoloni;Roberto Bartolini;Francesca Frontini
;
Simonetta Montemagni;Valeria Quochi
;
Giulia Venturi
2022

Abstract

This paper describes the process of acquisition, cleaning, interpretation, coding and linguistic annotation of a collection of parliamentary debates from the Senate of the Italian Republic covering the COVID-19 pandemic emergency period and a former period for reference and comparison according to the CLARIN ParlaMint prescriptions. The corpus contains 1199 sessions and 79,373 speeches for a total of about 31 million words, and was encoded according to the ParlaCLARIN TEI XML format. It includes extensive metadata about the speakers, sessions, political parties and parliamentary groups. As required by the ParlaMint initiative, the corpus was also linguistically annotated for sentences, tokens, POS tags, lemmas and dependency syntax according to the universal dependencies guidelines. Named entity annotation and classification is also included. All linguistic annotation was performed automatically using state-of-the-art NLP technology with no manual revision. The Italian dataset is freely available as part of the larger ParlaMint 2.1 corpus deposited and archived in CLARIN repository together with all other national corpora. It is also available for direct analysis and inspection via various CLARIN services and has already been used both for research and educational purposes.
Campo DC Valore Lingua
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC en
dc.authority.orgunit Istituto di Informatica Giuridica e Sistemi Giudiziari - IGSG en
dc.authority.people Tommaso Agnoloni en
dc.authority.people Roberto Bartolini en
dc.authority.people Francesca Frontini en
dc.authority.people Simonetta Montemagni en
dc.authority.people Carlo Marchetti en
dc.authority.people Valeria Quochi en
dc.authority.people Manuela Ruisi en
dc.authority.people Giulia Venturi en
dc.collection.id.s 71c7200a-7c5f-4e83-8d57-d3d2ba88f40d *
dc.collection.name 04.01 Contributo in Atti di convegno *
dc.contributor.appartenenza Istituto di Informatica Giuridica e Sistemi Giudiziari - IGSG *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.contributor.appartenenza.mi 1108 *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.date.accessioned 2024/02/19 12:54:54 -
dc.date.available 2024/02/19 12:54:54 -
dc.date.firstsubmission 2024/12/19 16:56:36 *
dc.date.issued 2022 -
dc.date.submission 2025/02/24 19:07:24 *
dc.description.abstracteng This paper describes the process of acquisition, cleaning, interpretation, coding and linguistic annotation of a collection of parliamentary debates from the Senate of the Italian Republic covering the COVID-19 pandemic emergency period and a former period for reference and comparison according to the CLARIN ParlaMint prescriptions. The corpus contains 1199 sessions and 79,373 speeches for a total of about 31 million words, and was encoded according to the ParlaCLARIN TEI XML format. It includes extensive metadata about the speakers, sessions, political parties and parliamentary groups. As required by the ParlaMint initiative, the corpus was also linguistically annotated for sentences, tokens, POS tags, lemmas and dependency syntax according to the universal dependencies guidelines. Named entity annotation and classification is also included. All linguistic annotation was performed automatically using state-of-the-art NLP technology with no manual revision. The Italian dataset is freely available as part of the larger ParlaMint 2.1 corpus deposited and archived in CLARIN repository together with all other national corpora. It is also available for direct analysis and inspection via various CLARIN services and has already been used both for research and educational purposes. -
dc.description.affiliations CNR-IGSG, Firenze Italy; CNR-ILC, Pisa Italy; Senato della Repubblica, Roma Italy -
dc.description.allpeople Agnoloni, Tommaso; Bartolini, Roberto; Frontini, Francesca; Montemagni, Simonetta; Marchetti, Carlo; Quochi, Valeria; Ruisi, Manuela; Venturi, Giulia -
dc.description.allpeopleoriginal Tommaso Agnoloni, Roberto Bartolini, Francesca Frontini, Simonetta Montemagni, Carlo Marchetti, Valeria Quochi, Manuela Ruisi, Giulia Venturi en
dc.description.fulltext open en
dc.description.numberofauthors 8 -
dc.identifier.isbn 979-10-95546-85-6 en
dc.identifier.scopus 2-s2.0-85145875643 -
dc.identifier.uri https://hdl.handle.net/20.500.14243/446358 -
dc.identifier.url https://aclanthology.org/2022.parlaclarin-1.17/ en
dc.language.iso eng en
dc.miur.last.status.update 2025-05-21T17:13:01Z *
dc.publisher.country FRA en
dc.publisher.name European Language Resources Association ELRA en
dc.publisher.place Paris en
dc.relation.conferencedate 20/06/2022 en
dc.relation.conferencename Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference en
dc.relation.conferenceplace Marseille, France en
dc.relation.firstpage 117 en
dc.relation.ispartofbook Proceedings of The Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference en
dc.relation.lastpage 124 en
dc.relation.numberofpages 8 en
dc.subject.keywords parliamentary debates -
dc.subject.keywords CLARIN ParlaMint -
dc.subject.keywords corpus creation -
dc.subject.keywords corpus annotation -
dc.subject.singlekeyword parliamentary debates *
dc.subject.singlekeyword CLARIN ParlaMint *
dc.subject.singlekeyword corpus creation *
dc.subject.singlekeyword corpus annotation *
dc.title Making Italian Parliamentary Records Machine-Actionable: the Construction of the ParlaMint-IT corpus en
dc.type.driver info:eu-repo/semantics/conferenceObject -
dc.type.full 04 Contributo in convegno::04.01 Contributo in Atti di convegno it
dc.type.miur 273 -
dc.type.referee Sì, ma tipo non specificato en
dc.ugov.descaux1 472294 -
iris.mediafilter.data 2025/04/06 02:36:52 *
iris.orcid.lastModifiedDate 2025/02/28 11:22:06 *
iris.orcid.lastModifiedMillisecond 1740738126346 *
iris.scopus.extIssued 2022 -
iris.scopus.extTitle Making Italian Parliamentary Records Machine-Actionable: The Construction of the ParlaMint-IT Corpus -
iris.sitodocente.maxattempts 1 -
scopus.category 1203 *
scopus.category 3304 *
scopus.category 3310 *
scopus.category 3309 *
scopus.contributor.affiliation CNR-IGSG -
scopus.contributor.affiliation CNR-ILC -
scopus.contributor.affiliation CNR-ILC -
scopus.contributor.affiliation Senato della Repubblica -
scopus.contributor.affiliation CNR-ILC -
scopus.contributor.affiliation CNR-ILC -
scopus.contributor.affiliation Senato della Repubblica -
scopus.contributor.affiliation CNR-ILC -
scopus.contributor.afid 60021199 -
scopus.contributor.afid 60021199 -
scopus.contributor.afid 60021199 -
scopus.contributor.afid 100729777 -
scopus.contributor.afid 60021199 -
scopus.contributor.afid 60021199 -
scopus.contributor.afid 100729777 -
scopus.contributor.afid 60021199 -
scopus.contributor.auid 57199421725 -
scopus.contributor.auid 22333654100 -
scopus.contributor.auid 55162070400 -
scopus.contributor.auid 7101710550 -
scopus.contributor.auid 15056781100 -
scopus.contributor.auid 34977412400 -
scopus.contributor.auid 58046145600 -
scopus.contributor.auid 27568199800 -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.country Italy -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.dptid -
scopus.contributor.name Tommaso -
scopus.contributor.name Roberto -
scopus.contributor.name Francesca -
scopus.contributor.name Carlo -
scopus.contributor.name Simonetta -
scopus.contributor.name Valeria -
scopus.contributor.name Manuela -
scopus.contributor.name Giulia -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.subaffiliation -
scopus.contributor.surname Agnoloni -
scopus.contributor.surname Bartolini -
scopus.contributor.surname Frontini -
scopus.contributor.surname Marchetti -
scopus.contributor.surname Montemagni -
scopus.contributor.surname Quochi -
scopus.contributor.surname Ruisi -
scopus.contributor.surname Venturi -
scopus.date.issued 2022 *
scopus.description.abstracteng This paper describes the process of acquisition, cleaning, interpretation, coding and linguistic annotation of a collection of parliamentary debates from the Senate of the Italian Republic covering the COVID-19 pandemic emergency period and a former period for reference and comparison according to the CLARIN ParlaMint prescriptions. The corpus contains 1199 sessions and 79,373 speeches for a total of about 31 million words, and was encoded according to the ParlaCLARIN TEI XML format. It includes extensive metadata about the speakers, sessions, political parties and parliamentary groups. As required by the ParlaMint initiative, the corpus was also linguistically annotated for sentences, tokens, POS tags, lemmas and dependency syntax according to the universal dependencies guidelines. Named entity annotation and classification is also included. All linguistic annotation was performed automatically using state-of-the-art NLP technology with no manual revision. The Italian dataset is freely available as part of the larger ParlaMint 2.1 corpus deposited and archived in CLARIN repository together with all other national corpora. It is also available for direct analysis and inspection via various CLARIN services and has already been used both for research and educational purposes. *
scopus.description.allpeopleoriginal Agnoloni T.; Bartolini R.; Frontini F.; Marchetti C.; Montemagni S.; Quochi V.; Ruisi M.; Venturi G. *
scopus.differences scopus.relation.conferencename *
scopus.differences scopus.publisher.name *
scopus.differences scopus.subject.keywords *
scopus.differences scopus.relation.conferencedate *
scopus.differences scopus.identifier.isbn *
scopus.differences scopus.description.allpeopleoriginal *
scopus.differences scopus.relation.conferenceplace *
scopus.document.type cp *
scopus.document.types cp *
scopus.identifier.isbn 9791095546856 *
scopus.identifier.pui 639991106 *
scopus.identifier.scopus 2-s2.0-85145875643 *
scopus.journal.sourceid 21101130979 *
scopus.language.iso eng *
scopus.publisher.name European Language Resources Association (ELRA) *
scopus.relation.conferencedate 2022 *
scopus.relation.conferencename 2022 Workshop on Creating, Enriching and Using Parliamentary Corpora, ParlaCLARIN III 2022 *
scopus.relation.conferenceplace fra *
scopus.relation.firstpage 117 *
scopus.relation.lastpage 124 *
scopus.subject.keywords CLARIN ParlaMint; corpus annotation; corpus creation; parliamentary debates; *
scopus.title Making Italian Parliamentary Records Machine-Actionable: The Construction of the ParlaMint-IT Corpus *
scopus.titleeng Making Italian Parliamentary Records Machine-Actionable: The Construction of the ParlaMint-IT Corpus *
Appare nelle tipologie: 04.01 Contributo in Atti di convegno
File in questo prodotto:
File Dimensione Formato  
prod_472294-doc_192197.pdf

accesso aperto

Descrizione: Paper
Tipologia: Versione Editoriale (PDF)
Licenza: Creative commons
Dimensione 233.27 kB
Formato Adobe PDF
233.27 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/446358
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 7
  • ???jsp.display-item.citation.isi??? ND
social impact