Enhancing author name disambiguation workflows in big data scholarly knowledge graphs

De Bonis, M.

Open Science, defined by its commitment to transparency, collaboration, openness, and accessibility, has deeply affected scientific research. Following this new paradigm, scientists produce and publish research data and software alongside research publications to enable reproducibility, monitoring, and assessment of science. In this context, Scholarly Knowledge Graphs (SKGs) are “big data” metadata collections, playing a crucial role in research discovery and assessment by aggregating bibliographic metadata records and semantic relationships describing research products and their associations between them (e.g., citations, versions) and with other entities, such as organizations, authors, funders, etc. Examples of SKGs are the OpenAIRE Graph, Google Scholar, OpenAlex, Semantic Scholar, OpenCitations, and Research- Graph.org. However, constructing and maintaining SKGs demands innovative solutions to address the inherent scalability, heterogeneity, duplication, inconsistency, and incompleteness challenges introduced by the metadata sources to be aggregated. Motivated by the urge of Open Science and the challenges posed by SKG construction, this Ph.D. thesis makes pioneering contributions to the field of Author Name Disambiguation (AND). This perennial issue addresses the challenge of identifying and removing duplicate author nodes representing the same author in the SKG. Acknowledging the pivotal role of AND, the thesis discerns two main interwoven imperatives in the duplicate resolution processes: mitigating the efficiency challenge derived by the inherent quadratic complexity in comparing hundreds of millions of author nodes; and the effectiveness challenge introduced by the efficiency optimization strategies, which renounce parts of the matches, and affected by the poverty of metadata used to compare author nodes, which is often limited to the name’s string. To address the efficiency challenge, the thesis introduces FDup, a groundbreaking framework meticulously designed to reimagine and enhance the traditional disambiguation workflow. At its core, FDup prioritizes the optimization of the similarity match phase. This optimization is achieved through the incorporation of a decision tree-based comparison technique. This innovative approach ensures a customizable and efficient disambiguation workflow and enables parallelization, a crucial aspect in handling the substantial datasets inherent in Scholarly Knowledge Graphs. To address the effectiveness challenge, the thesis leverages Graph Neural Networks III (GNNs), which have been recently successfully applied to perform innovative research on node classification, graph classification, and link prediction. The proposed contributions manifest in two dedicated GNN architectures to enhance the effectiveness of Author Name Disambiguation via an evaluation of the outputs of a disambiguation algorithm: the first technique evaluates similarity relationships with an attentive neural network integrating GraphSAGE models; the second technique evaluates groups of duplicates with a combination of Graph Attention Network (GAT) and Long Short Term Memory (LSTM) components. In summary, this thesis is a responsive and forward-thinking contribution within the landscape of Open Science and Scholarly Knowledge Graphs. By introducing novel frameworks and harnessing advanced techniques like Graph Neural Networks, the thesis not only addresses the current challenges but also lays the groundwork for the continual evolution of Open Science practices and the optimal utilization of Scholarly Knowledge Graphs in the ever-expanding realm of scientific knowledge.

Enhancing author name disambiguation workflows in big data scholarly knowledge graphs / De Bonis, M.. - ELETTRONICO. - (2024).