ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.0 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene.The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfying semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language.
Il dataset è aperto e scaricabile al link indicato
Parallel sense-annotated corpus ELEXIS-WSD 1.0
Valeria Quochi;Monica Monachini;Francesca Frontini;
2022
Abstract
ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.0 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene.The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfying semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language.| Campo DC | Valore | Lingua |
|---|---|---|
| dc.authority.orgunit | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | en |
| dc.authority.people | Federico Martelli | en |
| dc.authority.people | Roberto Navigli | en |
| dc.authority.people | Simon Krek | en |
| dc.authority.people | Jelena Kallas | en |
| dc.authority.people | Polona Gantar | en |
| dc.authority.people | Svetla Koeva | en |
| dc.authority.people | Sanni Nimb | en |
| dc.authority.people | Bolette Sandford Pedersen | en |
| dc.authority.people | Sussi Olsen | en |
| dc.authority.people | Margit Langemets | en |
| dc.authority.people | Kristina Koppel | en |
| dc.authority.people | Tiiu Üksik | en |
| dc.authority.people | Kaja Dobrovoljc | en |
| dc.authority.people | Rafael UreñaRuiz | en |
| dc.authority.people | JoséLuis SanchoSánchez | en |
| dc.authority.people | Veronika Lipp | en |
| dc.authority.people | Tamás Váradi | en |
| dc.authority.people | András Gyrffy | en |
| dc.authority.people | Simon László | en |
| dc.authority.people | Valeria Quochi | en |
| dc.authority.people | Monica Monachini | en |
| dc.authority.people | Francesca Frontini | en |
| dc.authority.people | Carole Tiberius | en |
| dc.authority.people | Rob Tempelaars | en |
| dc.authority.people | Rute Costa | en |
| dc.authority.people | Ana Salgado | en |
| dc.authority.people | Jaka ibej | en |
| dc.authority.people | Tina Munda | en |
| dc.authority.project | European Lexicographic Infrastructure | en |
| dc.collection.id.s | aa7ef5cb-003d-421c-b2c8-870fc44d02e5 | * |
| dc.collection.name | 05.10 Dataset | * |
| dc.contributor.appartenenza | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | * |
| dc.contributor.appartenenza.mi | 918 | * |
| dc.contributor.area | Non assegn | * |
| dc.contributor.area | Non assegn | * |
| dc.contributor.area | Non assegn | * |
| dc.date.accessioned | 2024/02/19 12:55:04 | - |
| dc.date.available | 2024/02/19 12:55:04 | - |
| dc.date.firstsubmission | 2025/03/03 13:02:07 | * |
| dc.date.issued | 2022 | - |
| dc.date.submission | 2025/03/03 13:02:07 | * |
| dc.description.abstracteng | ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.0 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene.The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfying semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language. | - |
| dc.description.abstractita | Il dataset è aperto e scaricabile al link indicato | - |
| dc.description.affiliations | Università la Sapienza, Roma, Italia; Josef Stefan Institute, Lubjana, Slovenia; Institute of the Estonian Language, Estonia; University of Ljubljana, Slovenia; Bulgarian Academy of Sciences, Bulgaria; Society for Danish Language and Literature, Danemark; Centre for Language Technology, Danemark; Centro de estudios de la Real Academia Espanola, Spain; Research Institute for Linguistics, Hungary; Istituto di Linguistica Computazionale "A. Zampolli", Consiglio Nazionale delle Ricerche, Italy; Dutch Language Institute, the Netherlands; Universidade Nova de Lisboa, Portugal; | - |
| dc.description.allpeople | Martelli, Federico; Navigli, Roberto; Krek, Simon; Kallas, Jelena; Gantar, Polona; Koeva, Svetla; Nimb, Sanni; Sandford Pedersen, Bolette; Olsen, Sussi; Langemets, Margit; Koppel, Kristina; Üksik, Tiiu; Dobrovoljc, Kaja; Ureñaruiz, Rafael; Sanchosánchez, Joséluis; Lipp, Veronika; Váradi, Tamás; Gyrffy, András; László, Simon; Quochi, Valeria; Monachini, Monica; Frontini, Francesca; Tiberius, Carole; Tempelaars, Rob; Costa, Rute; Salgado, Ana; Ibej, Jaka; Munda, Tina | - |
| dc.description.allpeopleoriginal | Federico Martelli, Roberto Navigli, Simon Krek, Jelena Kallas, Polona Gantar, Svetla Koeva, Sanni Nimb, Bolette Sandford Pedersen, Sussi Olsen, Margit Langemets, Kristina Koppel, Tiiu Üksik, Kaja Dobrovoljc, Rafael Ureña-Ruiz, José-Luis Sancho-Sánchez, Veronika Lipp, Tamás Váradi, András Gy?rffy, Simon László, Valeria Quochi, Monica Monachini, Francesca Frontini, Carole Tiberius, Rob Tempelaars, Rute Costa, Ana Salgado, Jaka ?ibej, Tina Munda | en |
| dc.description.fulltext | open | en |
| dc.description.international | si | en |
| dc.description.numberofauthors | 28 | - |
| dc.identifier.uri | https://hdl.handle.net/20.500.14243/446359 | - |
| dc.identifier.url | http://hdl.handle.net/11356/1674 | en |
| dc.language.iso | ita | en |
| dc.language.iso | bul | en |
| dc.language.iso | dan | en |
| dc.language.iso | est | en |
| dc.language.iso | dut | en |
| dc.language.iso | por | en |
| dc.language.iso | slv | en |
| dc.language.iso | spa | en |
| dc.language.iso | hun | en |
| dc.miur.last.status.update | 2025-03-03T12:03:15Z | * |
| dc.relation.medium | ELETTRONICO | en |
| dc.relation.projectAcronym | ELEXIS | en |
| dc.relation.projectAwardNumber | 731015 | en |
| dc.relation.projectAwardTitle | European Lexicographic Infrastructure | en |
| dc.relation.projectFunderName | European Commission | en |
| dc.relation.projectFundingStream | H2020 | en |
| dc.subject.keywords | Word Sense Disambiguation | - |
| dc.subject.keywords | corpus parallelo | - |
| dc.subject.keywords | disambiguazione automatica del senso | - |
| dc.subject.keywords | annotazione semantica multilingue | - |
| dc.subject.singlekeyword | Word Sense Disambiguation | * |
| dc.subject.singlekeyword | corpus parallelo | * |
| dc.subject.singlekeyword | disambiguazione automatica del senso | * |
| dc.subject.singlekeyword | annotazione semantica multilingue | * |
| dc.title | Parallel sense-annotated corpus ELEXIS-WSD 1.0 | en |
| dc.type.driver | info:eu-repo/semantics/other | - |
| dc.type.full | 05 Altro::05.10 Dataset | it |
| dc.type.miur | 295 | - |
| dc.ugov.descaux1 | 472295 | - |
| iris.mediafilter.data | 2025/04/04 04:13:17 | * |
| iris.orcid.lastModifiedDate | 2025/03/05 11:41:22 | * |
| iris.orcid.lastModifiedMillisecond | 1741171282645 | * |
| iris.sitodocente.maxattempts | 1 | - |
| Appare nelle tipologie: | 05.10 Dataset | |
| File | Dimensione | Formato | |
|---|---|---|---|
|
Parallel sense-annotated corpus ELEXIS-WSD 1.0 - Scheda catalogo.pdf
accesso aperto
Descrizione: CLARIN.SI catalogue metadata descriptions of the deposited dataset
Tipologia:
Versione Editoriale (PDF)
Licenza:
Creative commons
Dimensione
741.17 kB
Formato
Adobe PDF
|
741.17 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


