Focusing on only one type of structural component in the process of clustering XML documents may produce clusters with a certain extent of inner structural inhomogeneity, due either to uncaught differences in the overall logical structures of the available XML documents or to inappropriate choices of the targeted structural component. To overcome these limitations, two approaches to clustering XML documents by multiple heterogeneous structures are proposed. An approach looks at the simultaneous occurrences of such structures across the individual XML documents. The other approach instead combines multiple clusterings of the XML documents, separately performed with respect to the individual types of structures in isolation. A comparative evaluation over both real and synthetic XML data proved that the effectiveness of the devised approaches is at least on a par and even superior with respect to the effectiveness of state-of-the-art competitors. Additionally, the empirical evidence also reveals that the proposed approaches outperform such competitors in terms of time efficiency.
Structure-oriented techniques for XML document partitioning
Gianni Costa;Riccardo Ortale
2016
Abstract
Focusing on only one type of structural component in the process of clustering XML documents may produce clusters with a certain extent of inner structural inhomogeneity, due either to uncaught differences in the overall logical structures of the available XML documents or to inappropriate choices of the targeted structural component. To overcome these limitations, two approaches to clustering XML documents by multiple heterogeneous structures are proposed. An approach looks at the simultaneous occurrences of such structures across the individual XML documents. The other approach instead combines multiple clusterings of the XML documents, separately performed with respect to the individual types of structures in isolation. A comparative evaluation over both real and synthetic XML data proved that the effectiveness of the devised approaches is at least on a par and even superior with respect to the effectiveness of state-of-the-art competitors. Additionally, the empirical evidence also reveals that the proposed approaches outperform such competitors in terms of time efficiency.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


