ParlaMint 4.1 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and extending to mid-2022. The individual corpora comprise between 9 and 126 million words and the complete set contains over 1.2 billion words. The transcriptions are divided by days with information on the term, session and meeting, and contain speeches marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. The corpora have extensive metadata, most importantly on speakers (name, gender, MP and minister status, party affiliation), on their political parties and parliamentary groups (name, coalition/opposition status, Wikipedia-sourced left-to-right political orientation, and CHES variables, https://www.chesdata.eu/). Note that some corpora have further metadata, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The transcriptions are also marked with the subcorpora they belong to ("reference", until 2020-01-30, "covid", from 2020-01-31, and "war", from 2022-02-24). An overview of the statistics of the corpora is avaialable on GitHub in the folder Build/Metadata, in particular for the release 4.1 at https://github.com/clarin-eric/ParlaMint/tree/v4.1/Build/Metadata. The corpora are encoded according to the ParlaMint encoding guidelines (https://clarin-eric.github.io/ParlaMint/) and schemas (included in the distribution). The ParlaMint.ana linguistic annotation includes tokenization; sentence segmentation; lemmatisation; Universal Dependencies part-of-speech, morphological features, and syntactic dependencies; and the 4-class CoNLL-2003 named entities. Some corpora also have further linguistic annotations, in particular PoS tagging according a language-specific scheme, with their corpus TEI headers giving further details on the annotation vocabularies and tools used. This entry contains the ParlaMint.ana TEI-encoded linguistically annotated corpora; the derived CoNLL-U files along with TSV metadata of the speeches; and the derived vertical files (with their registry file), suitable for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText. Also included is the 4.1 release of the sample data and scripts available at the GitHub repository of the ParlaMint project at https://github.com/clarin-eric/ParlaMint and the log files produced in the process of building the corpora for this release. The log files show e.g. known errors in the corpora, while more information about known problems is available in the open issues at the GitHub repository of the project. This entry contains the linguistically marked-up version of the corpus, while the text version, i.e. without the linguistic annotation is also available at http://hdl.handle.net/11356/1912. Another related resource, namely the ParlaMint corpora machine translated to English ParlaMint-en.ana 4.1 can be found at http://hdl.handle.net/11356/1910. As opposed to the previous version 4.0, this version fixes a number of bugs and restructures the ParlaMint GitHub repository. The DK corpus has been linguistically re-annotated to remove bugs, while its speeches are now also marked with topics. The PT corpus has been extended to 2024-03 and the UA corpus to 2023-11, which also has improved language marking (uk vs. ru) on segments.

Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 4.1

Agnoloni, Tommaso;Bartolini, Roberto;Frontini, Francesca;Montemagni, Simonetta;Quochi, Valeria;Venturi, Giulia;
2024

Abstract

ParlaMint 4.1 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and extending to mid-2022. The individual corpora comprise between 9 and 126 million words and the complete set contains over 1.2 billion words. The transcriptions are divided by days with information on the term, session and meeting, and contain speeches marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. The corpora have extensive metadata, most importantly on speakers (name, gender, MP and minister status, party affiliation), on their political parties and parliamentary groups (name, coalition/opposition status, Wikipedia-sourced left-to-right political orientation, and CHES variables, https://www.chesdata.eu/). Note that some corpora have further metadata, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The transcriptions are also marked with the subcorpora they belong to ("reference", until 2020-01-30, "covid", from 2020-01-31, and "war", from 2022-02-24). An overview of the statistics of the corpora is avaialable on GitHub in the folder Build/Metadata, in particular for the release 4.1 at https://github.com/clarin-eric/ParlaMint/tree/v4.1/Build/Metadata. The corpora are encoded according to the ParlaMint encoding guidelines (https://clarin-eric.github.io/ParlaMint/) and schemas (included in the distribution). The ParlaMint.ana linguistic annotation includes tokenization; sentence segmentation; lemmatisation; Universal Dependencies part-of-speech, morphological features, and syntactic dependencies; and the 4-class CoNLL-2003 named entities. Some corpora also have further linguistic annotations, in particular PoS tagging according a language-specific scheme, with their corpus TEI headers giving further details on the annotation vocabularies and tools used. This entry contains the ParlaMint.ana TEI-encoded linguistically annotated corpora; the derived CoNLL-U files along with TSV metadata of the speeches; and the derived vertical files (with their registry file), suitable for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText. Also included is the 4.1 release of the sample data and scripts available at the GitHub repository of the ParlaMint project at https://github.com/clarin-eric/ParlaMint and the log files produced in the process of building the corpora for this release. The log files show e.g. known errors in the corpora, while more information about known problems is available in the open issues at the GitHub repository of the project. This entry contains the linguistically marked-up version of the corpus, while the text version, i.e. without the linguistic annotation is also available at http://hdl.handle.net/11356/1912. Another related resource, namely the ParlaMint corpora machine translated to English ParlaMint-en.ana 4.1 can be found at http://hdl.handle.net/11356/1910. As opposed to the previous version 4.0, this version fixes a number of bugs and restructures the ParlaMint GitHub repository. The DK corpus has been linguistically re-annotated to remove bugs, while its speeches are now also marked with topics. The PT corpus has been extended to 2024-03 and the UA corpus to 2023-11, which also has improved language marking (uk vs. ru) on segments.
Campo DC Valore Lingua
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC en
dc.authority.orgunit Istituto di Informatica Giuridica e Sistemi Giudiziari - IGSG en
dc.authority.people Erjavec, Tomaž en
dc.authority.people Kopp, Matyáš en
dc.authority.people Ogrodniczuk, Maciej en
dc.authority.people Osenova, Petya en
dc.authority.people Agerri, Rodrigo en
dc.authority.people Agirrezabal, Manex en
dc.authority.people Agnoloni, Tommaso en
dc.authority.people Aires, José en
dc.authority.people Albini, Monica en
dc.authority.people Alkorta, Jon en
dc.authority.people Antiba-Cartazo, Iván en
dc.authority.people Arrieta, Ekain en
dc.authority.people Barcala, Mario en
dc.authority.people Bardanca, Daniel en
dc.authority.people Barkarson, Starkaður en
dc.authority.people Bartolini, Roberto en
dc.authority.people Battistoni, Roberto en
dc.authority.people Bel, Nuria en
dc.authority.people Bonet Ramos, Maria del Mar en
dc.authority.people Calzada Pérez, María en
dc.authority.people Cardoso, Aida en
dc.authority.people Çöltekin, Çağrı en
dc.authority.people Coole, Matthew en
dc.authority.people Darģis, Roberts en
dc.authority.people de Does, Jesse en
dc.authority.people de Libano, Ruben en
dc.authority.people Depoorter, Griet en
dc.authority.people Depuydt, Katrien en
dc.authority.people Diwersy, Sascha en
dc.authority.people Dodé, Réka en
dc.authority.people Fernandez, Kike en
dc.authority.people Fernández Rei, Elisa en
dc.authority.people Frontini, Francesca en
dc.authority.people Garcia, Marcos en
dc.authority.people García Díaz, Noelia en
dc.authority.people García Louzao, Pedro en
dc.authority.people Gavriilidou, Maria en
dc.authority.people Gkoumas, Dimitris en
dc.authority.people Grigorov, Ilko en
dc.authority.people Grigorova, Vladislava en
dc.authority.people Haltrup Hansen, Dorte en
dc.authority.people Iruskieta, Mikel en
dc.authority.people Jarlbrink, Johan en
dc.authority.people Jelencsik-Mátyus, Kinga en
dc.authority.people Jongejan, Bart en
dc.authority.people Kahusk, Neeme en
dc.authority.people Kirnbauer, Martin en
dc.authority.people Kryvenko, Anna en
dc.authority.people Ligeti-Nagy, Noémi en
dc.authority.people Ljubešić, Nikola en
dc.authority.people Luxardo, Giancarlo en
dc.authority.people Magariños, Carmen en
dc.authority.people Magnusson, Måns en
dc.authority.people Marchetti, Carlo en
dc.authority.people Marx, Maarten en
dc.authority.people Meden, Katja en
dc.authority.people Mendes, Amália en
dc.authority.people Mochtak, Michal en
dc.authority.people Mölder, Martin en
dc.authority.people Montemagni, Simonetta en
dc.authority.people Navarretta, Costanza en
dc.authority.people Nitoń, Bartłomiej en
dc.authority.people Norén, Fredrik Mohammadi en
dc.authority.people Nwadukwe, Amanda en
dc.authority.people Ojsteršek, Mihael en
dc.authority.people Pančur, Andrej en
dc.authority.people Papavassiliou, Vassilis en
dc.authority.people Pereira, Rui en
dc.authority.people Pérez Lago, María en
dc.authority.people Piperidis, Stelios en
dc.authority.people Pirker, Hannes en
dc.authority.people Pisani, Marilina en
dc.authority.people Pol, Henk van der en
dc.authority.people Prokopidis, Prokopis en
dc.authority.people Quochi, Valeria en
dc.authority.people Rayson, Paul en
dc.authority.people Regueira, Xosé Luís en
dc.authority.people Rii, Andriana en
dc.authority.people Rudolf, Michał en
dc.authority.people Ruisi, Manuela en
dc.authority.people Rupnik, Peter en
dc.authority.people Schopper, Daniel en
dc.authority.people Simov, Kiril en
dc.authority.people Sinikallio, Laura en
dc.authority.people Skubic, Jure en
dc.authority.people Tamper, Minna en
dc.authority.people Tungland, Lars Magne en
dc.authority.people Tuominen, Jouni en
dc.authority.people van Heusden, Ruben en
dc.authority.people Varga, Zsófia en
dc.authority.people Vázquez Abuín, Marta en
dc.authority.people Venturi, Giulia en
dc.authority.people Vidal Miguéns, Adrián en
dc.authority.people Vider, Kadri en
dc.authority.people Vivel Couso, Ainhoa en
dc.authority.people Vladu, Adina Ioana en
dc.authority.people Wissik, Tanja en
dc.authority.people Yrjänäinen, Väinö en
dc.authority.people Zevallos, Rodolfo en
dc.authority.people Fišer, Darja en
dc.authority.project ParlaMint: Comparable and Interoperable Parliamentary Corpora en
dc.collection.id.s aa7ef5cb-003d-421c-b2c8-870fc44d02e5 *
dc.collection.name 05.10 Dataset *
dc.contributor.appartenenza Istituto di Informatica Giuridica e Sistemi Giudiziari - IGSG *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.contributor.appartenenza.mi 1108 *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.date.accessioned 2024/12/20 17:16:13 -
dc.date.available 2024/12/20 17:16:13 -
dc.date.firstsubmission 2024/07/05 17:32:46 *
dc.date.issued 2024 -
dc.date.submission 2025/03/05 17:50:45 *
dc.description.abstracteng ParlaMint 4.1 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and extending to mid-2022. The individual corpora comprise between 9 and 126 million words and the complete set contains over 1.2 billion words. The transcriptions are divided by days with information on the term, session and meeting, and contain speeches marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. The corpora have extensive metadata, most importantly on speakers (name, gender, MP and minister status, party affiliation), on their political parties and parliamentary groups (name, coalition/opposition status, Wikipedia-sourced left-to-right political orientation, and CHES variables, https://www.chesdata.eu/). Note that some corpora have further metadata, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The transcriptions are also marked with the subcorpora they belong to ("reference", until 2020-01-30, "covid", from 2020-01-31, and "war", from 2022-02-24). An overview of the statistics of the corpora is avaialable on GitHub in the folder Build/Metadata, in particular for the release 4.1 at https://github.com/clarin-eric/ParlaMint/tree/v4.1/Build/Metadata. The corpora are encoded according to the ParlaMint encoding guidelines (https://clarin-eric.github.io/ParlaMint/) and schemas (included in the distribution). The ParlaMint.ana linguistic annotation includes tokenization; sentence segmentation; lemmatisation; Universal Dependencies part-of-speech, morphological features, and syntactic dependencies; and the 4-class CoNLL-2003 named entities. Some corpora also have further linguistic annotations, in particular PoS tagging according a language-specific scheme, with their corpus TEI headers giving further details on the annotation vocabularies and tools used. This entry contains the ParlaMint.ana TEI-encoded linguistically annotated corpora; the derived CoNLL-U files along with TSV metadata of the speeches; and the derived vertical files (with their registry file), suitable for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText. Also included is the 4.1 release of the sample data and scripts available at the GitHub repository of the ParlaMint project at https://github.com/clarin-eric/ParlaMint and the log files produced in the process of building the corpora for this release. The log files show e.g. known errors in the corpora, while more information about known problems is available in the open issues at the GitHub repository of the project. This entry contains the linguistically marked-up version of the corpus, while the text version, i.e. without the linguistic annotation is also available at http://hdl.handle.net/11356/1912. Another related resource, namely the ParlaMint corpora machine translated to English ParlaMint-en.ana 4.1 can be found at http://hdl.handle.net/11356/1910. As opposed to the previous version 4.0, this version fixes a number of bugs and restructures the ParlaMint GitHub repository. The DK corpus has been linguistically re-annotated to remove bugs, while its speeches are now also marked with topics. The PT corpus has been extended to 2024-03 and the UA corpus to 2023-11, which also has improved language marking (uk vs. ru) on segments. -
dc.description.allpeople Erjavec, Tomaž; Kopp, Matyáš; Ogrodniczuk, Maciej; Osenova, Petya; Agerri, Rodrigo; Agirrezabal, Manex; Agnoloni, Tommaso; Aires, José; Albini, Monica; Alkorta, Jon; Antiba-Cartazo, Iván; Arrieta, Ekain; Barcala, Mario; Bardanca, Daniel; Barkarson, Starkaður; Bartolini, Roberto; Battistoni, Roberto; Bel, Nuria; Bonet Ramos, Maria del Mar; Calzada Pérez, María; Cardoso, Aida; Çöltekin, Çağrı; Coole, Matthew; Darģis, Roberts; de Does, Jesse; de Libano, Ruben; Depoorter, Griet; Depuydt, Katrien; Diwersy, Sascha; Dodé, Réka; Fernandez, Kike; Fernández Rei, Elisa; Frontini, Francesca; Garcia, Marcos; García Díaz, Noelia; García Louzao, Pedro; Gavriilidou, Maria; Gkoumas, Dimitris; Grigorov, Ilko; Grigorova, Vladislava; Haltrup Hansen, Dorte; Iruskieta, Mikel; Jarlbrink, Johan; Jelencsik-Mátyus, Kinga; Jongejan, Bart; Kahusk, Neeme; Kirnbauer, Martin; Kryvenko, Anna; Ligeti-Nagy, Noémi; Ljubešić, Nikola; Luxardo, Giancarlo; Magariños, Carmen; Magnusson, Måns; Marchetti, Carlo; Marx, Maarten; Meden, Katja; Mendes, Amália; Mochtak, Michal; Mölder, Martin; Montemagni, Simonetta; Navarretta, Costanza; Nitoń, Bartłomiej; Norén, Fredrik Mohammadi; Nwadukwe, Amanda; Ojsteršek, Mihael; Pančur, Andrej; Papavassiliou, Vassilis; Pereira, Rui; Pérez Lago, María; Piperidis, Stelios; Pirker, Hannes; Pisani, Marilina; Pol, Henk van der; Prokopidis, Prokopis; Quochi, Valeria; Rayson, Paul; Regueira, Xosé Luís; Rii, Andriana; Rudolf, Michał; Ruisi, Manuela; Rupnik, Peter; Schopper, Daniel; Simov, Kiril; Sinikallio, Laura; Skubic, Jure; Tamper, Minna; Tungland, Lars Magne; Tuominen, Jouni; van Heusden, Ruben; Varga, Zsófia; Vázquez Abuín, Marta; Venturi, Giulia; Vidal Miguéns, Adrián; Vider, Kadri; Vivel Couso, Ainhoa; Vladu, Adina Ioana; Wissik, Tanja; Yrjänäinen, Väinö; Zevallos, Rodolfo; Fišer, Darja -
dc.description.allpeopleoriginal Erjavec, Tomaž ; Kopp, Matyáš ; Ogrodniczuk, Maciej ; Osenova, Petya ; Agerri, Rodrigo ; Agirrezabal, Manex ; Agnoloni, Tommaso ; Aires, José ; Albini, Monica ; Alkorta, Jon ; Antiba-Cartazo, Iván ; Arrieta, Ekain ; Barcala, Mario ; Bardanca, Daniel ; Barkarson, Starkaður ; Bartolini, Roberto ; Battistoni, Roberto ; Bel, Nuria ; Bonet Ramos, Maria del Mar ; Calzada Pérez, María ; Cardoso, Aida ; Çöltekin, Çağrı ; Coole, Matthew ; Darģis, Roberts ; de Does, Jesse ; de Libano, Ruben ; Depoorter, Griet ; Depuydt, Katrien ; Diwersy, Sascha ; Dodé, Réka ; Fernandez, Kike ; Fernández Rei, Elisa ; Frontini, Francesca ; Garcia, Marcos ; García Díaz, Noelia ; García Louzao, Pedro ; Gavriilidou, Maria ; Gkoumas, Dimitris ; Grigorov, Ilko ; Grigorova, Vladislava ; Haltrup Hansen, Dorte ; Iruskieta, Mikel ; Jarlbrink, Johan ; Jelencsik-Mátyus, Kinga ; Jongejan, Bart ; Kahusk, Neeme ; Kirnbauer, Martin ; Kryvenko, Anna ; Ligeti-Nagy, Noémi ; Ljubešić, Nikola ; Luxardo, Giancarlo ; Magariños, Carmen ; Magnusson, Måns ; Marchetti, Carlo ; Marx, Maarten ; Meden, Katja ; Mendes, Amália ; Mochtak, Michal ; Mölder, Martin ; Montemagni, Simonetta ; Navarretta, Costanza ; Nitoń, Bartłomiej ; Norén, Fredrik Mohammadi ; Nwadukwe, Amanda ; Ojsteršek, Mihael ; Pančur, Andrej ; Papavassiliou, Vassilis ; Pereira, Rui ; Pérez Lago, María ; Piperidis, Stelios ; Pirker, Hannes ; Pisani, Marilina ; Pol, Henk van der ; Prokopidis, Prokopis ; Quochi, Valeria ; Rayson, Paul ; Regueira, Xosé Luís ; Rii, Andriana ; Rudolf, Michał ; Ruisi, Manuela ; Rupnik, Peter ; Schopper, Daniel ; Simov, Kiril ; Sinikallio, Laura ; Skubic, Jure ; Tamper, Minna ; Tungland, Lars Magne ; Tuominen, Jouni ; van Heusden, Ruben ; Varga, Zsófia ; Vázquez Abuín, Marta ; Venturi, Giulia ; Vidal Miguéns, Adrián ; Vider, Kadri ; Vivel Couso, Ainhoa ; Vladu, Adina Ioana ; Wissik, Tanja ; Yrjänäinen, Väinö ; Zevallos, Rodolfo ; Fišer, Darja en
dc.description.fulltext open en
dc.description.international si en
dc.description.numberofauthors 100 -
dc.identifier.source manual *
dc.identifier.uri https://hdl.handle.net/20.500.14243/483001 -
dc.identifier.url http://hdl.handle.net/11356/1911 en
dc.language.iso eng en
dc.language.iso ita en
dc.language.iso baq en
dc.language.iso bos en
dc.language.iso bul en
dc.language.iso cat en
dc.language.iso cze en
dc.language.iso hrv en
dc.language.iso dan en
dc.language.iso est en
dc.language.iso fin en
dc.language.iso fre en
dc.language.iso glg en
dc.language.iso gre en
dc.language.iso lav en
dc.language.iso nor en
dc.language.iso pol en
dc.language.iso por en
dc.language.iso rus en
dc.language.iso srp en
dc.language.iso slv en
dc.language.iso spa en
dc.language.iso swe en
dc.language.iso ger en
dc.language.iso tur en
dc.language.iso ukr en
dc.language.iso hun en
dc.relation.medium ELETTRONICO en
dc.relation.projectAcronym ParlaMint en
dc.relation.projectAwardNumber - en
dc.relation.projectAwardTitle ParlaMint: Comparable and Interoperable Parliamentary Corpora en
dc.relation.projectFunderName CLARIN-ERIC en
dc.relation.projectFundingStream - en
dc.subject.keywordseng ParlaCLARIN, linguistic annotation, pos-tagging, Named Entity Recognition, linguistic dependency annotation, UD -
dc.subject.singlekeyword ParlaCLARIN *
dc.subject.singlekeyword linguistic annotation *
dc.subject.singlekeyword pos-tagging *
dc.subject.singlekeyword Named Entity Recognition *
dc.subject.singlekeyword linguistic dependency annotation *
dc.subject.singlekeyword UD *
dc.title Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 4.1 en
dc.type.driver info:eu-repo/semantics/other -
dc.type.full 05 Altro::05.10 Dataset it
dc.type.miur 295 -
iris.mediafilter.data 2025/04/03 04:06:51 *
iris.orcid.lastModifiedDate 2025/03/06 11:42:04 *
iris.orcid.lastModifiedMillisecond 1741257724509 *
iris.sitodocente.maxattempts 1 -
Appare nelle tipologie: 05.10 Dataset
File in questo prodotto:
File Dimensione Formato  
link_dataset_ParlaMint_4.1.pdf

accesso aperto

Descrizione: Link al Datase
Tipologia: Altro materiale allegato
Licenza: Creative commons
Dimensione 485.94 kB
Formato Adobe PDF
485.94 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/483001
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact