CNR Institutional Research Information System

ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.0 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene.The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfying semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language.

Il dataset è aperto e scaricabile al link indicato

Parallel sense-annotated corpus ELEXIS-WSD 1.0

Federico Martelli;Roberto Navigli;Simon Krek;Jelena Kallas;Polona Gantar;Svetla Koeva;Sanni Nimb;Bolette Sandford Pedersen;Sussi Olsen;Margit Langemets;Kristina Koppel;Tiiu Üksik;Kaja Dobrovoljc;Rafael UreñaRuiz;JoséLuis SanchoSánchez;Veronika Lipp;Tamás Váradi;András Gyrffy;Simon László;Valeria Quochi;Monica Monachini;Francesca Frontini;Carole Tiberius;Rob Tempelaars;Rute Costa;Ana Salgado;Jaka ibej;Tina Munda

2022

Abstract

ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.0 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene.The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfying semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language.

Scheda breve

Scheda completa

Scheda completa (DC)

Campo DC	Valore	Lingua
dc.authority.orgunit	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	en
dc.authority.people	Federico Martelli	en
dc.authority.people	Roberto Navigli	en
dc.authority.people	Simon Krek	en
dc.authority.people	Jelena Kallas	en
dc.authority.people	Polona Gantar	en
dc.authority.people	Svetla Koeva	en
dc.authority.people	Sanni Nimb	en
dc.authority.people	Bolette Sandford Pedersen	en
dc.authority.people	Sussi Olsen	en
dc.authority.people	Margit Langemets	en
dc.authority.people	Kristina Koppel	en
dc.authority.people	Tiiu Üksik	en
dc.authority.people	Kaja Dobrovoljc	en
dc.authority.people	Rafael UreñaRuiz	en
dc.authority.people	JoséLuis SanchoSánchez	en
dc.authority.people	Veronika Lipp	en
dc.authority.people	Tamás Váradi	en
dc.authority.people	András Gyrffy	en
dc.authority.people	Simon László	en
dc.authority.people	Valeria Quochi	en
dc.authority.people	Monica Monachini	en
dc.authority.people	Francesca Frontini	en
dc.authority.people	Carole Tiberius	en
dc.authority.people	Rob Tempelaars	en
dc.authority.people	Rute Costa	en
dc.authority.people	Ana Salgado	en
dc.authority.people	Jaka ibej	en
dc.authority.people	Tina Munda	en
dc.authority.project	European Lexicographic Infrastructure	en
dc.collection.id.s	aa7ef5cb-003d-421c-b2c8-870fc44d02e5	*
dc.collection.name	05.10 Dataset	*
dc.contributor.appartenenza	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	*
dc.contributor.appartenenza.mi	918	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.contributor.area	Non assegn	*
dc.date.accessioned	2024/02/19 12:55:04	-
dc.date.available	2024/02/19 12:55:04	-
dc.date.firstsubmission	2025/03/03 13:02:07	*
dc.date.issued	2022	-
dc.date.submission	2025/03/03 13:02:07	*
dc.description.abstracteng	ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.0 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene.The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfying semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language.	-
dc.description.abstractita	Il dataset è aperto e scaricabile al link indicato	-
dc.description.affiliations	Università la Sapienza, Roma, Italia; Josef Stefan Institute, Lubjana, Slovenia; Institute of the Estonian Language, Estonia; University of Ljubljana, Slovenia; Bulgarian Academy of Sciences, Bulgaria; Society for Danish Language and Literature, Danemark; Centre for Language Technology, Danemark; Centro de estudios de la Real Academia Espanola, Spain; Research Institute for Linguistics, Hungary; Istituto di Linguistica Computazionale "A. Zampolli", Consiglio Nazionale delle Ricerche, Italy; Dutch Language Institute, the Netherlands; Universidade Nova de Lisboa, Portugal;	-
dc.description.allpeople	Martelli, Federico; Navigli, Roberto; Krek, Simon; Kallas, Jelena; Gantar, Polona; Koeva, Svetla; Nimb, Sanni; Sandford Pedersen, Bolette; Olsen, Sussi; Langemets, Margit; Koppel, Kristina; Üksik, Tiiu; Dobrovoljc, Kaja; Ureñaruiz, Rafael; Sanchosánchez, Joséluis; Lipp, Veronika; Váradi, Tamás; Gyrffy, András; László, Simon; Quochi, Valeria; Monachini, Monica; Frontini, Francesca; Tiberius, Carole; Tempelaars, Rob; Costa, Rute; Salgado, Ana; Ibej, Jaka; Munda, Tina	-
dc.description.allpeopleoriginal	Federico Martelli, Roberto Navigli, Simon Krek, Jelena Kallas, Polona Gantar, Svetla Koeva, Sanni Nimb, Bolette Sandford Pedersen, Sussi Olsen, Margit Langemets, Kristina Koppel, Tiiu Üksik, Kaja Dobrovoljc, Rafael Ureña-Ruiz, José-Luis Sancho-Sánchez, Veronika Lipp, Tamás Váradi, András Gy?rffy, Simon László, Valeria Quochi, Monica Monachini, Francesca Frontini, Carole Tiberius, Rob Tempelaars, Rute Costa, Ana Salgado, Jaka ?ibej, Tina Munda	en
dc.description.fulltext	open	en
dc.description.international	si	en
dc.description.numberofauthors	28	-
dc.identifier.uri	https://hdl.handle.net/20.500.14243/446359	-
dc.identifier.url	http://hdl.handle.net/11356/1674	en
dc.language.iso	ita	en
dc.language.iso	bul	en
dc.language.iso	dan	en
dc.language.iso	est	en
dc.language.iso	dut	en
dc.language.iso	por	en
dc.language.iso	slv	en
dc.language.iso	spa	en
dc.language.iso	hun	en
dc.miur.last.status.update	2025-03-03T12:03:15Z	*
dc.relation.medium	ELETTRONICO	en
dc.relation.projectAcronym	ELEXIS	en
dc.relation.projectAwardNumber	731015	en
dc.relation.projectAwardTitle	European Lexicographic Infrastructure	en
dc.relation.projectFunderName	European Commission	en
dc.relation.projectFundingStream	H2020	en
dc.subject.keywords	Word Sense Disambiguation	-
dc.subject.keywords	corpus parallelo	-
dc.subject.keywords	disambiguazione automatica del senso	-
dc.subject.keywords	annotazione semantica multilingue	-
dc.subject.singlekeyword	Word Sense Disambiguation	*
dc.subject.singlekeyword	corpus parallelo	*
dc.subject.singlekeyword	disambiguazione automatica del senso	*
dc.subject.singlekeyword	annotazione semantica multilingue	*
dc.title	Parallel sense-annotated corpus ELEXIS-WSD 1.0	en
dc.type.driver	info:eu-repo/semantics/other	-
dc.type.full	05 Altro::05.10 Dataset	it
dc.type.miur	295	-
dc.ugov.descaux1	472295	-
iris.mediafilter.data	2025/04/04 04:13:17	*
iris.orcid.lastModifiedDate	2025/03/05 11:41:22	*
iris.orcid.lastModifiedMillisecond	1741171282645	*
iris.sitodocente.maxattempts	1	-
Appare nelle tipologie:	05.10 Dataset

File in questo prodotto:

File	Dimensione	Formato
Parallel sense-annotated corpus ELEXIS-WSD 1.0 - Scheda catalogo.pdf accesso aperto Descrizione: CLARIN.SI catalogue metadata descriptions of the deposited dataset Tipologia: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 741.17 kB Formato Adobe PDF Visualizza/Apri	741.17 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/446359

Citazioni

ND

ND

ND

social impact