ParlaMint 2.1 is a multilingual set of 17 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2020, with each corpus being about 20 million words in size. The sessions in the corpora are marked as belonging to the COVID-19 period (from November 1st 2019), or being "reference" (before that date). The corpora have extensive metadata, including aspects of the parliament; the speakers (name, gender, MP status, party affiliation, party coalition/opposition); are structured into time-stamped terms, sessions and meetings; with speeches being marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. Note that some corpora have further information, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The corpora are encoded according to the Parla-CLARIN TEI recommendation (https://clarin-eric.github.io/parla-clarin/), but have been validated against the compatible, but much stricter ParlaMint schemas. This entry contains the linguistically marked-up version of the corpus, while the text version is available at http://hdl.handle.net/11356/1432. The ParlaMint.ana linguistic annotation includes tokenization, sentence segmentation, lemmatisation, Universal Dependencies part-of-speech, morphological features, and syntactic dependencies, and the 4-class CoNLL-2003 named entities. Some corpora also have further linguistic annotations, such as PoS tagging or named entities according to language-specific schemes, with their corpus TEI headers giving further details on the annotation vocabularies and tools.

Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 2.1

Tommaso Agnoloni;Francesca Frontini;Simonetta Montemagni;Valeria Quochi;Giulia Venturi;Roberto Bartolini;Andrea Cimino;
2021

Abstract

ParlaMint 2.1 is a multilingual set of 17 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2020, with each corpus being about 20 million words in size. The sessions in the corpora are marked as belonging to the COVID-19 period (from November 1st 2019), or being "reference" (before that date). The corpora have extensive metadata, including aspects of the parliament; the speakers (name, gender, MP status, party affiliation, party coalition/opposition); are structured into time-stamped terms, sessions and meetings; with speeches being marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. Note that some corpora have further information, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The corpora are encoded according to the Parla-CLARIN TEI recommendation (https://clarin-eric.github.io/parla-clarin/), but have been validated against the compatible, but much stricter ParlaMint schemas. This entry contains the linguistically marked-up version of the corpus, while the text version is available at http://hdl.handle.net/11356/1432. The ParlaMint.ana linguistic annotation includes tokenization, sentence segmentation, lemmatisation, Universal Dependencies part-of-speech, morphological features, and syntactic dependencies, and the 4-class CoNLL-2003 named entities. Some corpora also have further linguistic annotations, such as PoS tagging or named entities according to language-specific schemes, with their corpus TEI headers giving further details on the annotation vocabularies and tools.
2021
Istituto di linguistica computazionale "Antonio Zampolli" - ILC
Inglese
Italiano
Bulgaro
Ceco
Danese
Francese (Altre)
Islandese
Lettone
Lituano
Olandese
Polacco
Sloveno
Spagnolo
Turco
Ungherese
http://hdl.handle.net/11356/1431
covid-19
ParlaCLARIN
CLARIN
linguistic annotation
pos-tagging
Named Entity Recognition
linguistic dependency annotation
UD
dibattiti parlamentari
parlamenti
discorso politico
Il dataset riponde pienamente ai principi dei dati FAIR.
Elettronico
44
Erjavec, Toma; Ogrodniczuk, Maciej; Osenova, Petya; Ljubei, Nikola; Simov, Kiril; Grigorova, Vladislava; Rudolf, Micha; Panur, Andrej; Kopp, Matyá; Ba...espandi
05 Altro::05.10 Dataset
info:eu-repo/semantics/other
open
295
   ParlaMint: Comparable and Interoperable Parliamentary Corpora
   ParlaMint
   CLARIN-ERIC
File in questo prodotto:
File Dimensione Formato  
ParlaMint_LinguisticallyAnnotated2.1.pdf

accesso aperto

Descrizione: Metadata descriptors of the dataset deposited in the CLARIN.SI reposotory
Tipologia: Altro materiale allegato
Licenza: Creative commons
Dimensione 915.54 kB
Formato Adobe PDF
915.54 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/446076
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact