This paper describes the process of acquisition, cleaning, interpretation, coding and linguistic annotation of a collection of parliamentary debates from the Senate of the Italian Republic covering the COVID-19 pandemic emergency period and a former period for reference and comparison according to the CLARIN ParlaMint prescriptions. The corpus contains 1199 sessions and 79,373 speeches for a total of about 31 million words, and was encoded according to the ParlaCLARIN TEI XML format. It includes extensive metadata about the speakers, sessions, political parties and parliamentary groups. As required by the ParlaMint initiative, the corpus was also linguistically annotated for sentences, tokens, POS tags, lemmas and dependency syntax according to the universal dependencies guidelines. Named entity annotation and classification is also included. All linguistic annotation was performed automatically using state-of-the-art NLP technology with no manual revision. The Italian dataset is freely available as part of the larger ParlaMint 2.1 corpus deposited and archived in CLARIN repository together with all other national corpora. It is also available for direct analysis and inspection via various CLARIN services and has already been used both for research and educational purposes.
Making Italian Parliamentary Records Machine-Actionable: the Construction of the ParlaMint-IT corpus
Tommaso Agnoloni;Roberto Bartolini;Francesca Frontini
;Simonetta Montemagni;Valeria Quochi
;Giulia Venturi
2022
Abstract
This paper describes the process of acquisition, cleaning, interpretation, coding and linguistic annotation of a collection of parliamentary debates from the Senate of the Italian Republic covering the COVID-19 pandemic emergency period and a former period for reference and comparison according to the CLARIN ParlaMint prescriptions. The corpus contains 1199 sessions and 79,373 speeches for a total of about 31 million words, and was encoded according to the ParlaCLARIN TEI XML format. It includes extensive metadata about the speakers, sessions, political parties and parliamentary groups. As required by the ParlaMint initiative, the corpus was also linguistically annotated for sentences, tokens, POS tags, lemmas and dependency syntax according to the universal dependencies guidelines. Named entity annotation and classification is also included. All linguistic annotation was performed automatically using state-of-the-art NLP technology with no manual revision. The Italian dataset is freely available as part of the larger ParlaMint 2.1 corpus deposited and archived in CLARIN repository together with all other national corpora. It is also available for direct analysis and inspection via various CLARIN services and has already been used both for research and educational purposes.| Campo DC | Valore | Lingua |
|---|---|---|
| dc.authority.orgunit | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | en |
| dc.authority.orgunit | Istituto di Informatica Giuridica e Sistemi Giudiziari - IGSG | en |
| dc.authority.people | Tommaso Agnoloni | en |
| dc.authority.people | Roberto Bartolini | en |
| dc.authority.people | Francesca Frontini | en |
| dc.authority.people | Simonetta Montemagni | en |
| dc.authority.people | Carlo Marchetti | en |
| dc.authority.people | Valeria Quochi | en |
| dc.authority.people | Manuela Ruisi | en |
| dc.authority.people | Giulia Venturi | en |
| dc.collection.id.s | 71c7200a-7c5f-4e83-8d57-d3d2ba88f40d | * |
| dc.collection.name | 04.01 Contributo in Atti di convegno | * |
| dc.contributor.appartenenza | Istituto di Informatica Giuridica e Sistemi Giudiziari - IGSG | * |
| dc.contributor.appartenenza | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | * |
| dc.contributor.appartenenza.mi | 918 | * |
| dc.contributor.appartenenza.mi | 1108 | * |
| dc.contributor.area | Non assegn | * |
| dc.contributor.area | Non assegn | * |
| dc.contributor.area | Non assegn | * |
| dc.contributor.area | Non assegn | * |
| dc.contributor.area | Non assegn | * |
| dc.contributor.area | Non assegn | * |
| dc.date.accessioned | 2024/02/19 12:54:54 | - |
| dc.date.available | 2024/02/19 12:54:54 | - |
| dc.date.firstsubmission | 2024/12/19 16:56:36 | * |
| dc.date.issued | 2022 | - |
| dc.date.submission | 2025/02/24 19:07:24 | * |
| dc.description.abstracteng | This paper describes the process of acquisition, cleaning, interpretation, coding and linguistic annotation of a collection of parliamentary debates from the Senate of the Italian Republic covering the COVID-19 pandemic emergency period and a former period for reference and comparison according to the CLARIN ParlaMint prescriptions. The corpus contains 1199 sessions and 79,373 speeches for a total of about 31 million words, and was encoded according to the ParlaCLARIN TEI XML format. It includes extensive metadata about the speakers, sessions, political parties and parliamentary groups. As required by the ParlaMint initiative, the corpus was also linguistically annotated for sentences, tokens, POS tags, lemmas and dependency syntax according to the universal dependencies guidelines. Named entity annotation and classification is also included. All linguistic annotation was performed automatically using state-of-the-art NLP technology with no manual revision. The Italian dataset is freely available as part of the larger ParlaMint 2.1 corpus deposited and archived in CLARIN repository together with all other national corpora. It is also available for direct analysis and inspection via various CLARIN services and has already been used both for research and educational purposes. | - |
| dc.description.affiliations | CNR-IGSG, Firenze Italy; CNR-ILC, Pisa Italy; Senato della Repubblica, Roma Italy | - |
| dc.description.allpeople | Agnoloni, Tommaso; Bartolini, Roberto; Frontini, Francesca; Montemagni, Simonetta; Marchetti, Carlo; Quochi, Valeria; Ruisi, Manuela; Venturi, Giulia | - |
| dc.description.allpeopleoriginal | Tommaso Agnoloni, Roberto Bartolini, Francesca Frontini, Simonetta Montemagni, Carlo Marchetti, Valeria Quochi, Manuela Ruisi, Giulia Venturi | en |
| dc.description.fulltext | open | en |
| dc.description.numberofauthors | 8 | - |
| dc.identifier.isbn | 979-10-95546-85-6 | en |
| dc.identifier.scopus | 2-s2.0-85145875643 | - |
| dc.identifier.uri | https://hdl.handle.net/20.500.14243/446358 | - |
| dc.identifier.url | https://aclanthology.org/2022.parlaclarin-1.17/ | en |
| dc.language.iso | eng | en |
| dc.miur.last.status.update | 2025-05-21T17:13:01Z | * |
| dc.publisher.country | FRA | en |
| dc.publisher.name | European Language Resources Association ELRA | en |
| dc.publisher.place | Paris | en |
| dc.relation.conferencedate | 20/06/2022 | en |
| dc.relation.conferencename | Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference | en |
| dc.relation.conferenceplace | Marseille, France | en |
| dc.relation.firstpage | 117 | en |
| dc.relation.ispartofbook | Proceedings of The Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference | en |
| dc.relation.lastpage | 124 | en |
| dc.relation.numberofpages | 8 | en |
| dc.subject.keywords | parliamentary debates | - |
| dc.subject.keywords | CLARIN ParlaMint | - |
| dc.subject.keywords | corpus creation | - |
| dc.subject.keywords | corpus annotation | - |
| dc.subject.singlekeyword | parliamentary debates | * |
| dc.subject.singlekeyword | CLARIN ParlaMint | * |
| dc.subject.singlekeyword | corpus creation | * |
| dc.subject.singlekeyword | corpus annotation | * |
| dc.title | Making Italian Parliamentary Records Machine-Actionable: the Construction of the ParlaMint-IT corpus | en |
| dc.type.driver | info:eu-repo/semantics/conferenceObject | - |
| dc.type.full | 04 Contributo in convegno::04.01 Contributo in Atti di convegno | it |
| dc.type.miur | 273 | - |
| dc.type.referee | Sì, ma tipo non specificato | en |
| dc.ugov.descaux1 | 472294 | - |
| iris.mediafilter.data | 2025/04/06 02:36:52 | * |
| iris.orcid.lastModifiedDate | 2025/02/28 11:22:06 | * |
| iris.orcid.lastModifiedMillisecond | 1740738126346 | * |
| iris.scopus.extIssued | 2022 | - |
| iris.scopus.extTitle | Making Italian Parliamentary Records Machine-Actionable: The Construction of the ParlaMint-IT Corpus | - |
| iris.sitodocente.maxattempts | 1 | - |
| scopus.category | 1203 | * |
| scopus.category | 3304 | * |
| scopus.category | 3310 | * |
| scopus.category | 3309 | * |
| scopus.contributor.affiliation | CNR-IGSG | - |
| scopus.contributor.affiliation | CNR-ILC | - |
| scopus.contributor.affiliation | CNR-ILC | - |
| scopus.contributor.affiliation | Senato della Repubblica | - |
| scopus.contributor.affiliation | CNR-ILC | - |
| scopus.contributor.affiliation | CNR-ILC | - |
| scopus.contributor.affiliation | Senato della Repubblica | - |
| scopus.contributor.affiliation | CNR-ILC | - |
| scopus.contributor.afid | 60021199 | - |
| scopus.contributor.afid | 60021199 | - |
| scopus.contributor.afid | 60021199 | - |
| scopus.contributor.afid | 100729777 | - |
| scopus.contributor.afid | 60021199 | - |
| scopus.contributor.afid | 60021199 | - |
| scopus.contributor.afid | 100729777 | - |
| scopus.contributor.afid | 60021199 | - |
| scopus.contributor.auid | 57199421725 | - |
| scopus.contributor.auid | 22333654100 | - |
| scopus.contributor.auid | 55162070400 | - |
| scopus.contributor.auid | 7101710550 | - |
| scopus.contributor.auid | 15056781100 | - |
| scopus.contributor.auid | 34977412400 | - |
| scopus.contributor.auid | 58046145600 | - |
| scopus.contributor.auid | 27568199800 | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.country | Italy | - |
| scopus.contributor.dptid | - | |
| scopus.contributor.dptid | - | |
| scopus.contributor.dptid | - | |
| scopus.contributor.dptid | - | |
| scopus.contributor.dptid | - | |
| scopus.contributor.dptid | - | |
| scopus.contributor.dptid | - | |
| scopus.contributor.dptid | - | |
| scopus.contributor.name | Tommaso | - |
| scopus.contributor.name | Roberto | - |
| scopus.contributor.name | Francesca | - |
| scopus.contributor.name | Carlo | - |
| scopus.contributor.name | Simonetta | - |
| scopus.contributor.name | Valeria | - |
| scopus.contributor.name | Manuela | - |
| scopus.contributor.name | Giulia | - |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.subaffiliation | - | |
| scopus.contributor.surname | Agnoloni | - |
| scopus.contributor.surname | Bartolini | - |
| scopus.contributor.surname | Frontini | - |
| scopus.contributor.surname | Marchetti | - |
| scopus.contributor.surname | Montemagni | - |
| scopus.contributor.surname | Quochi | - |
| scopus.contributor.surname | Ruisi | - |
| scopus.contributor.surname | Venturi | - |
| scopus.date.issued | 2022 | * |
| scopus.description.abstracteng | This paper describes the process of acquisition, cleaning, interpretation, coding and linguistic annotation of a collection of parliamentary debates from the Senate of the Italian Republic covering the COVID-19 pandemic emergency period and a former period for reference and comparison according to the CLARIN ParlaMint prescriptions. The corpus contains 1199 sessions and 79,373 speeches for a total of about 31 million words, and was encoded according to the ParlaCLARIN TEI XML format. It includes extensive metadata about the speakers, sessions, political parties and parliamentary groups. As required by the ParlaMint initiative, the corpus was also linguistically annotated for sentences, tokens, POS tags, lemmas and dependency syntax according to the universal dependencies guidelines. Named entity annotation and classification is also included. All linguistic annotation was performed automatically using state-of-the-art NLP technology with no manual revision. The Italian dataset is freely available as part of the larger ParlaMint 2.1 corpus deposited and archived in CLARIN repository together with all other national corpora. It is also available for direct analysis and inspection via various CLARIN services and has already been used both for research and educational purposes. | * |
| scopus.description.allpeopleoriginal | Agnoloni T.; Bartolini R.; Frontini F.; Marchetti C.; Montemagni S.; Quochi V.; Ruisi M.; Venturi G. | * |
| scopus.differences | scopus.relation.conferencename | * |
| scopus.differences | scopus.publisher.name | * |
| scopus.differences | scopus.subject.keywords | * |
| scopus.differences | scopus.relation.conferencedate | * |
| scopus.differences | scopus.identifier.isbn | * |
| scopus.differences | scopus.description.allpeopleoriginal | * |
| scopus.differences | scopus.relation.conferenceplace | * |
| scopus.document.type | cp | * |
| scopus.document.types | cp | * |
| scopus.identifier.isbn | 9791095546856 | * |
| scopus.identifier.pui | 639991106 | * |
| scopus.identifier.scopus | 2-s2.0-85145875643 | * |
| scopus.journal.sourceid | 21101130979 | * |
| scopus.language.iso | eng | * |
| scopus.publisher.name | European Language Resources Association (ELRA) | * |
| scopus.relation.conferencedate | 2022 | * |
| scopus.relation.conferencename | 2022 Workshop on Creating, Enriching and Using Parliamentary Corpora, ParlaCLARIN III 2022 | * |
| scopus.relation.conferenceplace | fra | * |
| scopus.relation.firstpage | 117 | * |
| scopus.relation.lastpage | 124 | * |
| scopus.subject.keywords | CLARIN ParlaMint; corpus annotation; corpus creation; parliamentary debates; | * |
| scopus.title | Making Italian Parliamentary Records Machine-Actionable: The Construction of the ParlaMint-IT Corpus | * |
| scopus.titleeng | Making Italian Parliamentary Records Machine-Actionable: The Construction of the ParlaMint-IT Corpus | * |
| Appare nelle tipologie: | 04.01 Contributo in Atti di convegno | |
| File | Dimensione | Formato | |
|---|---|---|---|
|
prod_472294-doc_192197.pdf
accesso aperto
Descrizione: Paper
Tipologia:
Versione Editoriale (PDF)
Licenza:
Creative commons
Dimensione
233.27 kB
Formato
Adobe PDF
|
233.27 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


