Today, several online services offer functionalities to access information from big scholarly communication graphs, which interlink entities such as publications, authors, datasets, organizations, etc. Such graphs are often populated over time as aggregations of multiple sources and therefore suffer from entity duplication problems. Although deduplication of graphs is a known and actual problem, solutions tend to be dedicated and address a few of the underlying challenges. In this paper, we propose the GDup system, an integrated, scalable, general-purpose system for entity deduplication over big information graphs. GDup supports practitioners with the functionalities needed to realize a fully-fledged entity deduplication workflow over a generic input graph, inclusive of Ground Truth support, end-user feedback, and strategies for identifying and merging duplicates to obtain an output disambiguated graph. GDup is today one of the core components of the OpenAIRE infrastructure production system, monitoring Open Science trends on behalf of the European Commission.

GDup: De-duplication of Scholarly Communication Big Graphs

Atzori C;Manghi P;Bardi A
2018

Abstract

Today, several online services offer functionalities to access information from big scholarly communication graphs, which interlink entities such as publications, authors, datasets, organizations, etc. Such graphs are often populated over time as aggregations of multiple sources and therefore suffer from entity duplication problems. Although deduplication of graphs is a known and actual problem, solutions tend to be dedicated and address a few of the underlying challenges. In this paper, we propose the GDup system, an integrated, scalable, general-purpose system for entity deduplication over big information graphs. GDup supports practitioners with the functionalities needed to realize a fully-fledged entity deduplication workflow over a generic input graph, inclusive of Ground Truth support, end-user feedback, and strategies for identifying and merging duplicates to obtain an output disambiguated graph. GDup is today one of the core components of the OpenAIRE infrastructure production system, monitoring Open Science trends on behalf of the European Commission.
2018
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
Inglese
2018 IEEE/ACM 5th International Conference on Big Data Computing Applications and Technologies (BDCAT)
142
151
10
978-1-5386-5502-3
https://ieeexplore.ieee.org/document/8606645
IEEE
New York
STATI UNITI D'AMERICA
Sì, ma tipo non specificato
17-20/12/2018
Zurigo
deduplication
information graphs
big data
scholarly communication
3
partially_open
Atzori, C; Manghi, P; Bardi, A
273
info:eu-repo/semantics/conferenceObject
04 Contributo in convegno::04.01 Contributo in Atti di convegno
   OpenAIRE Advancing Open Scholarship
   OpenAIRE-Advance
   H2020
   777541
File in questo prodotto:
File Dimensione Formato  
prod_401241-doc_139414.pdf

solo utenti autorizzati

Descrizione: GDup: De-duplication of Scholarly Communication Big Graphs
Tipologia: Versione Editoriale (PDF)
Dimensione 291.96 kB
Formato Adobe PDF
291.96 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
prod_401241-doc_139819.pdf

accesso aperto

Descrizione: GDup: De-duplication of Scholarly Communication Big Graphs
Tipologia: Versione Editoriale (PDF)
Dimensione 390.29 kB
Formato Adobe PDF
390.29 kB Adobe PDF Visualizza/Apri
prod_401241-doc_141095.pdf

accesso aperto

Descrizione: GDup: De-duplication of Scholarly Communication Big Graphs
Tipologia: Versione Editoriale (PDF)
Dimensione 390.29 kB
Formato Adobe PDF
390.29 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/359280
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 5
  • ???jsp.display-item.citation.isi??? 4
social impact