Today, several online services offer functionalities to access information from big scholarly communication graphs, which interlink entities such as publications, authors, datasets, organizations, etc. Such graphs are often populated over time as aggregations of multiple sources and therefore suffer from entity duplication problems. Although deduplication of graphs is a known and actual problem, solutions tend to be dedicated and address a few of the underlying challenges. In this paper, we propose the GDup system, an integrated, scalable, general-purpose system for entity deduplication over big information graphs. GDup supports practitioners with the functionalities needed to realize a fully-fledged entity deduplication workflow over a generic input graph, inclusive of Ground Truth support, end-user feedback, and strategies for identifying and merging duplicates to obtain an output disambiguated graph. GDup is today one of the core components of the OpenAIRE infrastructure production system, monitoring Open Science trends on behalf of the European Commission.
GDup: De-duplication of Scholarly Communication Big Graphs
Atzori C;Manghi P;Bardi A
2018
Abstract
Today, several online services offer functionalities to access information from big scholarly communication graphs, which interlink entities such as publications, authors, datasets, organizations, etc. Such graphs are often populated over time as aggregations of multiple sources and therefore suffer from entity duplication problems. Although deduplication of graphs is a known and actual problem, solutions tend to be dedicated and address a few of the underlying challenges. In this paper, we propose the GDup system, an integrated, scalable, general-purpose system for entity deduplication over big information graphs. GDup supports practitioners with the functionalities needed to realize a fully-fledged entity deduplication workflow over a generic input graph, inclusive of Ground Truth support, end-user feedback, and strategies for identifying and merging duplicates to obtain an output disambiguated graph. GDup is today one of the core components of the OpenAIRE infrastructure production system, monitoring Open Science trends on behalf of the European Commission.File | Dimensione | Formato | |
---|---|---|---|
prod_401241-doc_139414.pdf
solo utenti autorizzati
Descrizione: GDup: De-duplication of Scholarly Communication Big Graphs
Tipologia:
Versione Editoriale (PDF)
Dimensione
291.96 kB
Formato
Adobe PDF
|
291.96 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
prod_401241-doc_139819.pdf
accesso aperto
Descrizione: GDup: De-duplication of Scholarly Communication Big Graphs
Tipologia:
Versione Editoriale (PDF)
Dimensione
390.29 kB
Formato
Adobe PDF
|
390.29 kB | Adobe PDF | Visualizza/Apri |
prod_401241-doc_141095.pdf
accesso aperto
Descrizione: GDup: De-duplication of Scholarly Communication Big Graphs
Tipologia:
Versione Editoriale (PDF)
Dimensione
390.29 kB
Formato
Adobe PDF
|
390.29 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.