ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.0 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene.The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfying semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language.

Il dataset è aperto e scaricabile al link indicato

Parallel sense-annotated corpus ELEXIS-WSD 1.0

Valeria Quochi;Monica Monachini;Francesca Frontini;
2022

Abstract

ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.0 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene.The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfying semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language.
Campo DC Valore Lingua
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC en
dc.authority.people Federico Martelli en
dc.authority.people Roberto Navigli en
dc.authority.people Simon Krek en
dc.authority.people Jelena Kallas en
dc.authority.people Polona Gantar en
dc.authority.people Svetla Koeva en
dc.authority.people Sanni Nimb en
dc.authority.people Bolette Sandford Pedersen en
dc.authority.people Sussi Olsen en
dc.authority.people Margit Langemets en
dc.authority.people Kristina Koppel en
dc.authority.people Tiiu Üksik en
dc.authority.people Kaja Dobrovoljc en
dc.authority.people Rafael UreñaRuiz en
dc.authority.people JoséLuis SanchoSánchez en
dc.authority.people Veronika Lipp en
dc.authority.people Tamás Váradi en
dc.authority.people András Gyrffy en
dc.authority.people Simon László en
dc.authority.people Valeria Quochi en
dc.authority.people Monica Monachini en
dc.authority.people Francesca Frontini en
dc.authority.people Carole Tiberius en
dc.authority.people Rob Tempelaars en
dc.authority.people Rute Costa en
dc.authority.people Ana Salgado en
dc.authority.people Jaka ibej en
dc.authority.people Tina Munda en
dc.authority.project European Lexicographic Infrastructure en
dc.collection.id.s aa7ef5cb-003d-421c-b2c8-870fc44d02e5 *
dc.collection.name 05.10 Dataset *
dc.contributor.appartenenza Istituto di linguistica computazionale "Antonio Zampolli" - ILC *
dc.contributor.appartenenza.mi 918 *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.contributor.area Non assegn *
dc.date.accessioned 2024/02/19 12:55:04 -
dc.date.available 2024/02/19 12:55:04 -
dc.date.firstsubmission 2025/03/03 13:02:07 *
dc.date.issued 2022 -
dc.date.submission 2025/03/03 13:02:07 *
dc.description.abstracteng ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.0 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene.The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfying semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language. -
dc.description.abstractita Il dataset è aperto e scaricabile al link indicato -
dc.description.affiliations Università la Sapienza, Roma, Italia; Josef Stefan Institute, Lubjana, Slovenia; Institute of the Estonian Language, Estonia; University of Ljubljana, Slovenia; Bulgarian Academy of Sciences, Bulgaria; Society for Danish Language and Literature, Danemark; Centre for Language Technology, Danemark; Centro de estudios de la Real Academia Espanola, Spain; Research Institute for Linguistics, Hungary; Istituto di Linguistica Computazionale "A. Zampolli", Consiglio Nazionale delle Ricerche, Italy; Dutch Language Institute, the Netherlands; Universidade Nova de Lisboa, Portugal; -
dc.description.allpeople Martelli, Federico; Navigli, Roberto; Krek, Simon; Kallas, Jelena; Gantar, Polona; Koeva, Svetla; Nimb, Sanni; Sandford Pedersen, Bolette; Olsen, Sussi; Langemets, Margit; Koppel, Kristina; Üksik, Tiiu; Dobrovoljc, Kaja; Ureñaruiz, Rafael; Sanchosánchez, Joséluis; Lipp, Veronika; Váradi, Tamás; Gyrffy, András; László, Simon; Quochi, Valeria; Monachini, Monica; Frontini, Francesca; Tiberius, Carole; Tempelaars, Rob; Costa, Rute; Salgado, Ana; Ibej, Jaka; Munda, Tina -
dc.description.allpeopleoriginal Federico Martelli, Roberto Navigli, Simon Krek, Jelena Kallas, Polona Gantar, Svetla Koeva, Sanni Nimb, Bolette Sandford Pedersen, Sussi Olsen, Margit Langemets, Kristina Koppel, Tiiu Üksik, Kaja Dobrovoljc, Rafael Ureña-Ruiz, José-Luis Sancho-Sánchez, Veronika Lipp, Tamás Váradi, András Gy?rffy, Simon László, Valeria Quochi, Monica Monachini, Francesca Frontini, Carole Tiberius, Rob Tempelaars, Rute Costa, Ana Salgado, Jaka ?ibej, Tina Munda en
dc.description.fulltext open en
dc.description.international si en
dc.description.numberofauthors 28 -
dc.identifier.uri https://hdl.handle.net/20.500.14243/446359 -
dc.identifier.url http://hdl.handle.net/11356/1674 en
dc.language.iso ita en
dc.language.iso bul en
dc.language.iso dan en
dc.language.iso est en
dc.language.iso dut en
dc.language.iso por en
dc.language.iso slv en
dc.language.iso spa en
dc.language.iso hun en
dc.miur.last.status.update 2025-03-03T12:03:15Z *
dc.relation.medium ELETTRONICO en
dc.relation.projectAcronym ELEXIS en
dc.relation.projectAwardNumber 731015 en
dc.relation.projectAwardTitle European Lexicographic Infrastructure en
dc.relation.projectFunderName European Commission en
dc.relation.projectFundingStream H2020 en
dc.subject.keywords Word Sense Disambiguation -
dc.subject.keywords corpus parallelo -
dc.subject.keywords disambiguazione automatica del senso -
dc.subject.keywords annotazione semantica multilingue -
dc.subject.singlekeyword Word Sense Disambiguation *
dc.subject.singlekeyword corpus parallelo *
dc.subject.singlekeyword disambiguazione automatica del senso *
dc.subject.singlekeyword annotazione semantica multilingue *
dc.title Parallel sense-annotated corpus ELEXIS-WSD 1.0 en
dc.type.driver info:eu-repo/semantics/other -
dc.type.full 05 Altro::05.10 Dataset it
dc.type.miur 295 -
dc.ugov.descaux1 472295 -
iris.mediafilter.data 2025/04/04 04:13:17 *
iris.orcid.lastModifiedDate 2025/03/05 11:41:22 *
iris.orcid.lastModifiedMillisecond 1741171282645 *
iris.sitodocente.maxattempts 1 -
Appare nelle tipologie: 05.10 Dataset
File in questo prodotto:
File Dimensione Formato  
Parallel sense-annotated corpus ELEXIS-WSD 1.0 - Scheda catalogo.pdf

accesso aperto

Descrizione: CLARIN.SI catalogue metadata descriptions of the deposited dataset
Tipologia: Versione Editoriale (PDF)
Licenza: Creative commons
Dimensione 741.17 kB
Formato Adobe PDF
741.17 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/446359
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact