In the OpenAIRE context, research organizations are aggregated from several datasources. This often leads to a duplication problem because an organization can be provided by multiple datasources. Deduplication is a fundamental task to solve this problem. The deduplication in OpenAIRE follows three main stages: clustering of entities pairwise comparisons of entities in the same cluster to draw similarity relations identification of connected components to create representative entities that groups all the duplicates of each organization Given that the pairwise comparison stage is an automatic algorithm, many false positives (or negatives) can be found. The software available in this release provides the OpenOrgs web application: a web interface for the collection of user’s feedbacks in the context of organizations deduplication. An user can edit organization’s metadata and approve or reject similarity relations suggested by the deduplication algorithm. The deduplication algorithm takes advantage of user’s feedback to increase the precision and the recall of the results. The organizations resulting from the deduplication enhanced by the user feedback are indexed and subsequently exposed by the OpenAIRE portal. This application is distributed as part of the dnet-applications module which contains some web applications developed within the OpenAIRE-Connect and OpenAIRE-Advance projects.

OpenAIRE OpenOrgs database

Artini M.
;
De Bonis M.;Manghi P.;Atzori C.;Bardi A.;Baglioni M.
2021

Abstract

In the OpenAIRE context, research organizations are aggregated from several datasources. This often leads to a duplication problem because an organization can be provided by multiple datasources. Deduplication is a fundamental task to solve this problem. The deduplication in OpenAIRE follows three main stages: clustering of entities pairwise comparisons of entities in the same cluster to draw similarity relations identification of connected components to create representative entities that groups all the duplicates of each organization Given that the pairwise comparison stage is an automatic algorithm, many false positives (or negatives) can be found. The software available in this release provides the OpenOrgs web application: a web interface for the collection of user’s feedbacks in the context of organizations deduplication. An user can edit organization’s metadata and approve or reject similarity relations suggested by the deduplication algorithm. The deduplication algorithm takes advantage of user’s feedback to increase the precision and the recall of the results. The organizations resulting from the deduplication enhanced by the user feedback are indexed and subsequently exposed by the OpenAIRE portal. This application is distributed as part of the dnet-applications module which contains some web applications developed within the OpenAIRE-Connect and OpenAIRE-Advance projects.
2021
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
Persistent identifiers
Deduplication
File in questo prodotto:
File Dimensione Formato  
dnet-applications-3.1.8.zip

accesso aperto

Descrizione: https://openpolicyfinder.jisc.ac.uk/
Tipologia: Altro materiale allegato
Licenza: Creative commons
Dimensione 33.3 MB
Formato Zip File
33.3 MB Zip File Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/574027
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact