This paper presents PACE (Programmable Authority Control Engine), an authority control tool conceived to maintain 'aggregation authority fi les'. These are obtained as continuous aggregations of records originating from a variable set of information systems with heterogeneous and duplicated content. To facilitate record deduplication in the presence of such heterogeneity and dynamicity, PACE user interfaces enable an iterative curation process, where data curators can: (i) confi gure algorithms for the identifi cation of record duplicates; (ii) open work sessions where algorithm confi gurations can be run and evaluated; (iii) merge the identifi ed record duplicates to disambiguate the authority fi le and (iv) repeat this cycle several times. PACE supports a tunable probabilistic similarity measure and performs record matching with a customisable variation of the sorted neighbourhood heuristic. Finally, it addresses the underlying performance and scalability issues by exploiting multi-core parallel processing and Cassandra's storage systems, to support I/O performances that scale up linearly with the number of records.

De-duplication of aggregation authority files

Manghi P;Mikulicic M;Atzori C
2012

Abstract

This paper presents PACE (Programmable Authority Control Engine), an authority control tool conceived to maintain 'aggregation authority fi les'. These are obtained as continuous aggregations of records originating from a variable set of information systems with heterogeneous and duplicated content. To facilitate record deduplication in the presence of such heterogeneity and dynamicity, PACE user interfaces enable an iterative curation process, where data curators can: (i) confi gure algorithms for the identifi cation of record duplicates; (ii) open work sessions where algorithm confi gurations can be run and evaluated; (iii) merge the identifi ed record duplicates to disambiguate the authority fi le and (iv) repeat this cycle several times. PACE supports a tunable probabilistic similarity measure and performs record matching with a customisable variation of the sorted neighbourhood heuristic. Finally, it addresses the underlying performance and scalability issues by exploiting multi-core parallel processing and Cassandra's storage systems, to support I/O performances that scale up linearly with the number of records.
2012
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
Authority control
Record deduplication
Record merge
Record aggregations
Sorted neighbourhood
File in questo prodotto:
File Dimensione Formato  
prod_219035-doc_51513.pdf

solo utenti autorizzati

Descrizione: De-duplication of aggregation authority files
Tipologia: Versione Editoriale (PDF)
Dimensione 1.66 MB
Formato Adobe PDF
1.66 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
prod_219035-doc_200163.pdf

accesso aperto

Descrizione: De-duplication of aggregation authority files
Tipologia: Versione Editoriale (PDF)
Dimensione 608.72 kB
Formato Adobe PDF
608.72 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/4698
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 7
  • ???jsp.display-item.citation.isi??? ND
social impact