CNR Institutional Research Information System

This paper describes the process of acquisition, cleaning, interpretation, coding and linguistic annotation of a collection of parliamentary debates from the Senate of the Italian Republic covering the COVID-19 pandemic emergency period and a former period for reference and comparison according to the CLARIN ParlaMint prescriptions. The corpus contains 1199 sessions and 79,373 speeches for a total of about 31 million words, and was encoded according to the ParlaCLARIN TEI XML format. It includes extensive metadata about the speakers, sessions, political parties and parliamentary groups. As required by the ParlaMint initiative, the corpus was also linguistically annotated for sentences, tokens, POS tags, lemmas and dependency syntax according to the universal dependencies guidelines. Named entity annotation and classification is also included. All linguistic annotation was performed automatically using state-of-the-art NLP technology with no manual revision. The Italian dataset is freely available as part of the larger ParlaMint 2.1 corpus deposited and archived in CLARIN repository together with all other national corpora. It is also available for direct analysis and inspection via various CLARIN services and has already been used both for research and educational purposes.

Making Italian Parliamentary Records Machine-Actionable: the Construction of the ParlaMint-IT corpus

Tommaso Agnoloni;Roberto Bartolini;Francesca Frontini;Simonetta Montemagni;Carlo Marchetti;Valeria Quochi;Manuela Ruisi;Giulia Venturi

2022

Abstract

This paper describes the process of acquisition, cleaning, interpretation, coding and linguistic annotation of a collection of parliamentary debates from the Senate of the Italian Republic covering the COVID-19 pandemic emergency period and a former period for reference and comparison according to the CLARIN ParlaMint prescriptions. The corpus contains 1199 sessions and 79,373 speeches for a total of about 31 million words, and was encoded according to the ParlaCLARIN TEI XML format. It includes extensive metadata about the speakers, sessions, political parties and parliamentary groups. As required by the ParlaMint initiative, the corpus was also linguistically annotated for sentences, tokens, POS tags, lemmas and dependency syntax according to the universal dependencies guidelines. Named entity annotation and classification is also included. All linguistic annotation was performed automatically using state-of-the-art NLP technology with no manual revision. The Italian dataset is freely available as part of the larger ParlaMint 2.1 corpus deposited and archived in CLARIN repository together with all other national corpora. It is also available for direct analysis and inspection via various CLARIN services and has already been used both for research and educational purposes.

Scheda breve

Scheda completa

Scheda completa (DC)

Campo DC	Valore	Lingua
dc.authority.orgunit	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	en
dc.authority.orgunit	Istituto di Informatica Giuridica e Sistemi Giudiziari - IGSG	en
dc.authority.people	Tommaso Agnoloni	en
dc.authority.people	Roberto Bartolini	en
dc.authority.people	Francesca Frontini	en
dc.authority.people	Simonetta Montemagni	en
dc.authority.people	Carlo Marchetti	en
dc.authority.people	Valeria Quochi	en
dc.authority.people	Manuela Ruisi	en
dc.authority.people	Giulia Venturi	en
dc.collection.id.s	71c7200a-7c5f-4e83-8d57-d3d2ba88f40d	*
dc.collection.name	04.01 Contributo in Atti di convegno	*
dc.contributor.appartenenza	Istituto di Informatica Giuridica e Sistemi Giudiziari - IGSG	*
dc.contributor.appartenenza	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	*
dc.contributor.appartenenza.mi	918	*
dc.contributor.appartenenza.mi	1108	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.date.accessioned	2024/02/19 12:54:54	-
dc.date.available	2024/02/19 12:54:54	-
dc.date.firstsubmission	2024/12/19 16:56:36	*
dc.date.issued	2022	-
dc.date.submission	2025/02/24 19:07:24	*
dc.description.abstracteng	This paper describes the process of acquisition, cleaning, interpretation, coding and linguistic annotation of a collection of parliamentary debates from the Senate of the Italian Republic covering the COVID-19 pandemic emergency period and a former period for reference and comparison according to the CLARIN ParlaMint prescriptions. The corpus contains 1199 sessions and 79,373 speeches for a total of about 31 million words, and was encoded according to the ParlaCLARIN TEI XML format. It includes extensive metadata about the speakers, sessions, political parties and parliamentary groups. As required by the ParlaMint initiative, the corpus was also linguistically annotated for sentences, tokens, POS tags, lemmas and dependency syntax according to the universal dependencies guidelines. Named entity annotation and classification is also included. All linguistic annotation was performed automatically using state-of-the-art NLP technology with no manual revision. The Italian dataset is freely available as part of the larger ParlaMint 2.1 corpus deposited and archived in CLARIN repository together with all other national corpora. It is also available for direct analysis and inspection via various CLARIN services and has already been used both for research and educational purposes.	-
dc.description.affiliations	CNR-IGSG, Firenze Italy; CNR-ILC, Pisa Italy; Senato della Repubblica, Roma Italy	-
dc.description.allpeople	Agnoloni, Tommaso; Bartolini, Roberto; Frontini, Francesca; Montemagni, Simonetta; Marchetti, Carlo; Quochi, Valeria; Ruisi, Manuela; Venturi, Giulia	-
dc.description.allpeopleoriginal	Tommaso Agnoloni, Roberto Bartolini, Francesca Frontini, Simonetta Montemagni, Carlo Marchetti, Valeria Quochi, Manuela Ruisi, Giulia Venturi	en
dc.description.fulltext	open	en
dc.description.numberofauthors	8	-
dc.identifier.isbn	979-10-95546-85-6	en
dc.identifier.scopus	2-s2.0-85145875643	-
dc.identifier.uri	https://hdl.handle.net/20.500.14243/446358	-
dc.identifier.url	https://aclanthology.org/2022.parlaclarin-1.17/	en
dc.language.iso	eng	en
dc.miur.last.status.update	2025-05-21T17:13:01Z	*
dc.publisher.country	FRA	en
dc.publisher.name	European Language Resources Association ELRA	en
dc.publisher.place	Paris	en
dc.relation.conferencedate	20/06/2022	en
dc.relation.conferencename	Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference	en
dc.relation.conferenceplace	Marseille, France	en
dc.relation.firstpage	117	en
dc.relation.ispartofbook	Proceedings of The Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference	en
dc.relation.lastpage	124	en
dc.relation.numberofpages	8	en
dc.subject.keywords	parliamentary debates	-
dc.subject.keywords	CLARIN ParlaMint	-
dc.subject.keywords	corpus creation	-
dc.subject.keywords	corpus annotation	-
dc.subject.singlekeyword	parliamentary debates	*
dc.subject.singlekeyword	CLARIN ParlaMint	*
dc.subject.singlekeyword	corpus creation	*
dc.subject.singlekeyword	corpus annotation	*
dc.title	Making Italian Parliamentary Records Machine-Actionable: the Construction of the ParlaMint-IT corpus	en
dc.type.driver	info:eu-repo/semantics/conferenceObject	-
dc.type.full	04 Contributo in convegno::04.01 Contributo in Atti di convegno	it
dc.type.miur	273	-
dc.type.referee	Sì, ma tipo non specificato	en
dc.ugov.descaux1	472294	-
iris.mediafilter.data	2025/04/06 02:36:52	*
iris.orcid.lastModifiedDate	2025/02/28 11:22:06	*
iris.orcid.lastModifiedMillisecond	1740738126346	*
iris.scopus.extIssued	2022	-
iris.scopus.extTitle	Making Italian Parliamentary Records Machine-Actionable: The Construction of the ParlaMint-IT Corpus	-
iris.sitodocente.maxattempts	1	-
scopus.category	1203	*
scopus.category	3304	*
scopus.category	3310	*
scopus.category	3309	*
scopus.contributor.affiliation	CNR-IGSG	-
scopus.contributor.affiliation	CNR-ILC	-
scopus.contributor.affiliation	CNR-ILC	-
scopus.contributor.affiliation	Senato della Repubblica	-
scopus.contributor.affiliation	CNR-ILC	-
scopus.contributor.affiliation	CNR-ILC	-
scopus.contributor.affiliation	Senato della Repubblica	-
scopus.contributor.affiliation	CNR-ILC	-
scopus.contributor.afid	60021199	-
scopus.contributor.afid	60021199	-
scopus.contributor.afid	60021199	-
scopus.contributor.afid	100729777	-
scopus.contributor.afid	60021199	-
scopus.contributor.afid	60021199	-
scopus.contributor.afid	100729777	-
scopus.contributor.afid	60021199	-
scopus.contributor.auid	57199421725	-
scopus.contributor.auid	22333654100	-
scopus.contributor.auid	55162070400	-
scopus.contributor.auid	7101710550	-
scopus.contributor.auid	15056781100	-
scopus.contributor.auid	34977412400	-
scopus.contributor.auid	58046145600	-
scopus.contributor.auid	27568199800	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.country	Italy	-
scopus.contributor.dptid		-
scopus.contributor.dptid		-
scopus.contributor.dptid		-
scopus.contributor.dptid		-
scopus.contributor.dptid		-
scopus.contributor.dptid		-
scopus.contributor.dptid		-
scopus.contributor.dptid		-
scopus.contributor.name	Tommaso	-
scopus.contributor.name	Roberto	-
scopus.contributor.name	Francesca	-
scopus.contributor.name	Carlo	-
scopus.contributor.name	Simonetta	-
scopus.contributor.name	Valeria	-
scopus.contributor.name	Manuela	-
scopus.contributor.name	Giulia	-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation		-
scopus.contributor.subaffiliation		-
scopus.contributor.surname	Agnoloni	-
scopus.contributor.surname	Bartolini	-
scopus.contributor.surname	Frontini	-
scopus.contributor.surname	Marchetti	-
scopus.contributor.surname	Montemagni	-
scopus.contributor.surname	Quochi	-
scopus.contributor.surname	Ruisi	-
scopus.contributor.surname	Venturi	-
scopus.date.issued	2022	*
scopus.description.abstracteng	This paper describes the process of acquisition, cleaning, interpretation, coding and linguistic annotation of a collection of parliamentary debates from the Senate of the Italian Republic covering the COVID-19 pandemic emergency period and a former period for reference and comparison according to the CLARIN ParlaMint prescriptions. The corpus contains 1199 sessions and 79,373 speeches for a total of about 31 million words, and was encoded according to the ParlaCLARIN TEI XML format. It includes extensive metadata about the speakers, sessions, political parties and parliamentary groups. As required by the ParlaMint initiative, the corpus was also linguistically annotated for sentences, tokens, POS tags, lemmas and dependency syntax according to the universal dependencies guidelines. Named entity annotation and classification is also included. All linguistic annotation was performed automatically using state-of-the-art NLP technology with no manual revision. The Italian dataset is freely available as part of the larger ParlaMint 2.1 corpus deposited and archived in CLARIN repository together with all other national corpora. It is also available for direct analysis and inspection via various CLARIN services and has already been used both for research and educational purposes.	*
scopus.description.allpeopleoriginal	Agnoloni T.; Bartolini R.; Frontini F.; Marchetti C.; Montemagni S.; Quochi V.; Ruisi M.; Venturi G.	*
scopus.differences	scopus.relation.conferencename	*
scopus.differences	scopus.publisher.name	*
scopus.differences	scopus.subject.keywords	*
scopus.differences	scopus.relation.conferencedate	*
scopus.differences	scopus.identifier.isbn	*
scopus.differences	scopus.description.allpeopleoriginal	*
scopus.differences	scopus.relation.conferenceplace	*
scopus.document.type	cp	*
scopus.document.types	cp	*
scopus.identifier.isbn	9791095546856	*
scopus.identifier.pui	639991106	*
scopus.identifier.scopus	2-s2.0-85145875643	*
scopus.journal.sourceid	21101130979	*
scopus.language.iso	eng	*
scopus.publisher.name	European Language Resources Association (ELRA)	*
scopus.relation.conferencedate	2022	*
scopus.relation.conferencename	2022 Workshop on Creating, Enriching and Using Parliamentary Corpora, ParlaCLARIN III 2022	*
scopus.relation.conferenceplace	fra	*
scopus.relation.firstpage	117	*
scopus.relation.lastpage	124	*
scopus.subject.keywords	CLARIN ParlaMint; corpus annotation; corpus creation; parliamentary debates;	*
scopus.title	Making Italian Parliamentary Records Machine-Actionable: The Construction of the ParlaMint-IT Corpus	*
scopus.titleeng	Making Italian Parliamentary Records Machine-Actionable: The Construction of the ParlaMint-IT Corpus	*
Appare nelle tipologie:	04.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
prod_472294-doc_192197.pdf accesso aperto Descrizione: Paper Tipologia: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 233.27 kB Formato Adobe PDF Visualizza/Apri	233.27 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/446358

Citazioni

ND

8

ND

social impact