ParlaMint 2.1 is a multilingual set of 17 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2020, with each corpus being about 20 million words in size. The sessions in the corpora are marked as belonging to the COVID-19 period (after November 1st 2019), or being "reference" (before that date). The corpora have extensive metadata, including aspects of the parliament; the speakers (name, gender, MP status, party affiliation, party coalition/opposition); are structured into time-stamped terms, sessions and meetings; with speeches being marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. Note that some corpora have further information, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The corpora are encoded according to the Parla-CLARIN TEI recommendation (https://clarin-eric.github.io/parla-clarin/), but have been validated against the compatible, but much stricter ParlaMint schemas. This entry contains the ParlaMint TEI-encoded corpora with the derived plain text version of the corpus along with TSV metadata on the speeches. Also included is the 2.0 release of the data and scripts available at the GitHub repository of the ParlaMint project. Note that there also exists the linguistically marked-up version of the corpus, which is available at http://hdl.handle.net/11356/1431.

Multilingual comparable corpora of parliamentary debates ParlaMint 2.1

Tommaso Agnoloni;Francesca Frontini;Simonetta Montemagni;Valeria Quochi;Giulia Venturi;Roberto Bartolini;
2021

Abstract

ParlaMint 2.1 is a multilingual set of 17 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2020, with each corpus being about 20 million words in size. The sessions in the corpora are marked as belonging to the COVID-19 period (after November 1st 2019), or being "reference" (before that date). The corpora have extensive metadata, including aspects of the parliament; the speakers (name, gender, MP status, party affiliation, party coalition/opposition); are structured into time-stamped terms, sessions and meetings; with speeches being marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. Note that some corpora have further information, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The corpora are encoded according to the Parla-CLARIN TEI recommendation (https://clarin-eric.github.io/parla-clarin/), but have been validated against the compatible, but much stricter ParlaMint schemas. This entry contains the ParlaMint TEI-encoded corpora with the derived plain text version of the corpus along with TSV metadata on the speeches. Also included is the 2.0 release of the data and scripts available at the GitHub repository of the ParlaMint project. Note that there also exists the linguistically marked-up version of the corpus, which is available at http://hdl.handle.net/11356/1431.
Campo DC Valore Lingua
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC en
dc.authority.people Toma Erjavec en
dc.authority.people Maciej Ogrodniczuk en
dc.authority.people Petya Osenova en
dc.authority.people Nikola Ljubei en
dc.authority.people Kiril Simov en
dc.authority.people Vladislava Grigorova en
dc.authority.people Micha Rudolf en
dc.authority.people Andrej Panur en
dc.authority.people Matyá Kopp en
dc.authority.people Starkaður Barkarson en
dc.authority.people Steinþor Steingrímsson en
dc.authority.people Henk van der Pol en
dc.authority.people Griet Depoorter en
dc.authority.people Jesse de Does en
dc.authority.people Bart Jongejan en
dc.authority.people Dorte Haltrup Hansen en
dc.authority.people Costanza Navarretta en
dc.authority.people María Calzada Pérez en
dc.authority.people Luciana D de Macedo en
dc.authority.people Ruben van Heusden en
dc.authority.people Maarten Marx en
dc.authority.people Çar Çöltekin en
dc.authority.people Matthew Coole en
dc.authority.people Tommaso Agnoloni en
dc.authority.people Francesca Frontini en
dc.authority.people Simonetta Montemagni en
dc.authority.people Valeria Quochi en
dc.authority.people Giulia Venturi en
dc.authority.people Manuela Ruisi en
dc.authority.people Carlo Marchetti en
dc.authority.people Roberto Battistoni en
dc.authority.people Miklós Sebk en
dc.authority.people Orsolya Ring en
dc.authority.people Roberts Daris en
dc.authority.people Andrius Utka en
dc.authority.people Mindaugas Petkeviius en
dc.authority.people Monika Briediené en
dc.authority.people Tomas Krilaviius en
dc.authority.people Vaidas Morkeviius en
dc.authority.people Roberto Bartolini en
dc.authority.people Andrea Cimino en
dc.authority.people Sascha Diwersy en
dc.authority.people Giancarlo Luxardo en
dc.authority.people Paul Rayson en
dc.authority.project ParlaMint en
dc.collection.id.s aa7ef5cb-003d-421c-b2c8-870fc44d02e5 *
dc.collection.name 05.10 Dataset *
dc.contributor.appartenenza Istituto di Informatica Giuridica e Sistemi Giudiziari - IGSG *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.contributor.appartenenza.mi 1108 *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.date.accessioned 2024/02/19 12:00:45 -
dc.date.available 2024/02/19 12:00:45 -
dc.date.firstsubmission 2025/03/05 10:25:03 *
dc.date.issued 2021 -
dc.date.submission 2025/03/06 11:46:05 *
dc.description.abstracteng ParlaMint 2.1 is a multilingual set of 17 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2020, with each corpus being about 20 million words in size. The sessions in the corpora are marked as belonging to the COVID-19 period (after November 1st 2019), or being "reference" (before that date). The corpora have extensive metadata, including aspects of the parliament; the speakers (name, gender, MP status, party affiliation, party coalition/opposition); are structured into time-stamped terms, sessions and meetings; with speeches being marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. Note that some corpora have further information, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The corpora are encoded according to the Parla-CLARIN TEI recommendation (https://clarin-eric.github.io/parla-clarin/), but have been validated against the compatible, but much stricter ParlaMint schemas. This entry contains the ParlaMint TEI-encoded corpora with the derived plain text version of the corpus along with TSV metadata on the speeches. Also included is the 2.0 release of the data and scripts available at the GitHub repository of the ParlaMint project. Note that there also exists the linguistically marked-up version of the corpus, which is available at http://hdl.handle.net/11356/1431. -
dc.description.affiliations n.d. -
dc.description.allpeople Erjavec, Toma; Ogrodniczuk, Maciej; Osenova, Petya; Ljubei, Nikola; Simov, Kiril; Grigorova, Vladislava; Rudolf, Micha; Panur, Andrej; Kopp, Matyá; Barkarson, Starkaður; Steingrímsson, Steinþor; van der Pol, Henk; Depoorter, Griet; de Does, Jesse; Jongejan, Bart; Haltrup Hansen, Dorte; Navarretta, Costanza; Calzada Pérez, María; D de Macedo, Luciana; van Heusden, Ruben; Marx, Maarten; Çöltekin, Çar; Coole, Matthew; Agnoloni, Tommaso; Frontini, Francesca; Montemagni, Simonetta; Quochi, Valeria; Venturi, Giulia; Ruisi, Manuela; Marchetti, Carlo; Battistoni, Roberto; Sebk, Miklós; Ring, Orsolya; Daris, Roberts; Utka, Andrius; Petkeviius, Mindaugas; Briediené, Monika; Krilaviius, Tomas; Morkeviius, Vaidas; Bartolini, Roberto; Cimino, Andrea; Diwersy, Sascha; Luxardo, Giancarlo; Rayson, Paul -
dc.description.allpeopleoriginal Toma? Erjavec, Maciej Ogrodniczuk, Petya Osenova, Nikola Ljube?i?, Kiril Simov, Vladislava Grigorova, Micha? Rudolf, Andrej Pan?ur, Matyá? Kopp, Starkaður Barkarson, Steinþor Steingrímsson, Henk van der Pol, Griet Depoorter, Jesse de Does, Bart Jongejan, Dorte Haltrup Hansen, Costanza Navarretta, María Calzada Pérez, Luciana D. de Macedo, Ruben van Heusden, Maarten Marx, Ça?r? Çöltekin, Matthew Coole, Tommaso Agnoloni, Francesca Frontini, Simonetta Montemagni, Valeria Quochi, Giulia Venturi, Manuela Ruisi, Carlo Marchetti, Roberto Battistoni, Miklós Seb?k, Orsolya Ring, Roberts Dar?is, Andrius Utka, Mindaugas Petkevi?ius, Monika Briediené, Tomas Krilavi?ius, Vaidas Morkevi?ius, Roberto Bartolini, Andrea Cimino, Sascha Diwersy, Giancarlo Luxardo, Paul Rayson en
dc.description.fulltext open en
dc.description.international si en
dc.description.note Il dataset risponde pienamente ai principi dei dati FAIR. en
dc.description.numberofauthors 44 -
dc.identifier.uri https://hdl.handle.net/20.500.14243/446080 -
dc.identifier.url http://hdl.handle.net/11356/1432 en
dc.language.iso ita en
dc.language.iso bul en
dc.language.iso cze en
dc.language.iso dan en
dc.language.iso fre en
dc.language.iso ice en
dc.language.iso lav en
dc.language.iso lit en
dc.language.iso dut en
dc.language.iso pol en
dc.language.iso slv en
dc.language.iso spa en
dc.language.iso tur en
dc.language.iso hun en
dc.relation.medium ELETTRONICO en
dc.relation.projectAcronym ParlaMint en
dc.relation.projectAwardNumber - en
dc.relation.projectAwardTitle ParlaMint: Comparable and Interoperable Parliamentary Corpora en
dc.relation.projectFunderName CLARIN-ERIC en
dc.relation.projectFundingStream - en
dc.subject.keywordsita ParlaMint -
dc.subject.keywordsita ParlaCLARIN -
dc.subject.keywordsita dibattiti parlamentari -
dc.subject.keywordsita covid-19 -
dc.subject.keywordsita discorso politico -
dc.subject.keywordsita CLARIN -
dc.subject.singlekeyword ParlaMint *
dc.subject.singlekeyword ParlaCLARIN *
dc.subject.singlekeyword dibattiti parlamentari *
dc.subject.singlekeyword covid-19 *
dc.subject.singlekeyword discorso politico *
dc.subject.singlekeyword CLARIN *
dc.title Multilingual comparable corpora of parliamentary debates ParlaMint 2.1 en
dc.type.driver info:eu-repo/semantics/other -
dc.type.full 05 Altro::05.10 Dataset it
dc.type.miur 295 -
dc.ugov.descaux1 463865 -
iris.mediafilter.data 2025/04/03 03:50:06 *
iris.orcid.lastModifiedDate 2025/03/06 11:47:34 *
iris.orcid.lastModifiedMillisecond 1741258054291 *
iris.sitodocente.maxattempts 1 -
Appare nelle tipologie: 05.10 Dataset
File in questo prodotto:
File Dimensione Formato  
ParlaMint_MultilingualPlaintxt.pdf

accesso aperto

Descrizione: Matadata descriptors of the dataset deposited in the CLARIN.SI reposotory
Tipologia: Altro materiale allegato
Licenza: Creative commons
Dimensione 745.57 kB
Formato Adobe PDF
745.57 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/446080
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact