The emergence of the Web of Data, in particular Linked Open Data (LOD) [1], has led to an abundance of data available on the Web. Data is shared as part of datasets, often containing inter-dataset links [6], mostly concentrated on established datasets, such as DBpedia. Datasets vary significantly with respect to represented resource types, currentness, coverage of topics and domains, size, used languages, coherence, accessibility [3] or general quality aspects. The challenges from such diversity are underlined by the limited reuse of datasets from the LOD Cloud, where reuse and linking often focus on well-known datasets like DBpedia. Therefore, descriptive and reliable metadata are paramount to enable targeted search, assessment and reuse of datasets. To address these issues and building up on earlier work [4], we propose an automated approach for creating structured profiles describing the topic coverage of individual datasets. The proposed approach considers a combination of sampling, topic extraction and topic ranking techniques. The sampling process is used to determine the best trade-off between scalability and profiling accuracy. Topic ranking is based on an adoption of graphical models PageRank, K-Step Markov, and HITS, which introduces prior knowledge into the computation of vertex importance [7]. Finally, the generated profiles are exposed as part of a public dataset based on the Vocabulary of Interlinked Datasets (VoID)and the newly introduced vocabulary of links (VoL) which describes the degree of relatedness between datasets and topics.

What's all the Data about? - Creating Structured Profiles of Linked Data on the Web

Davide Taibi;
2014

Abstract

The emergence of the Web of Data, in particular Linked Open Data (LOD) [1], has led to an abundance of data available on the Web. Data is shared as part of datasets, often containing inter-dataset links [6], mostly concentrated on established datasets, such as DBpedia. Datasets vary significantly with respect to represented resource types, currentness, coverage of topics and domains, size, used languages, coherence, accessibility [3] or general quality aspects. The challenges from such diversity are underlined by the limited reuse of datasets from the LOD Cloud, where reuse and linking often focus on well-known datasets like DBpedia. Therefore, descriptive and reliable metadata are paramount to enable targeted search, assessment and reuse of datasets. To address these issues and building up on earlier work [4], we propose an automated approach for creating structured profiles describing the topic coverage of individual datasets. The proposed approach considers a combination of sampling, topic extraction and topic ranking techniques. The sampling process is used to determine the best trade-off between scalability and profiling accuracy. Topic ranking is based on an adoption of graphical models PageRank, K-Step Markov, and HITS, which introduces prior knowledge into the computation of vertex importance [7]. Finally, the generated profiles are exposed as part of a public dataset based on the Vocabulary of Interlinked Datasets (VoID)and the newly introduced vocabulary of links (VoL) which describes the degree of relatedness between datasets and topics.
2014
Istituto per le Tecnologie Didattiche - ITD - Sede Genova
978-1-4503-2745-9
File in questo prodotto:
File Dimensione Formato  
prod_294986-doc_84771.pdf

non disponibili

Descrizione: fethau
Dimensione 398.95 kB
Formato Adobe PDF
398.95 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/229999
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact