Conventional approaches to XML clustering by content and structure are generally affected by a limitation due to the adoption of the bag-of-word model for the representation of their textual contents. This choice may lead to consider structure-constrained textual items of separate XML documents as related, even though the actual meaning of such items in their respective contexts is different. To overcome such a limitation, we propose XML clustering by structure-constrained phrases. The latter is a previously unexplored method relying on the more accurate bag-of-phrase model of the XML textual content, with which to better preserve the meaning of the structure-constrained content items for improved clustering effectiveness. In order to conduct an in-depth and systematic study of the effectiveness of the proposed method, we develop a parameter-free prototypical approach to XML partitioning, which projects the XML documents into a space of XML features representing fixed-length sequences of adjacent textual items in the context of root-to-leaf paths. Feature selection without any tunable threshold is used to choose a subset of the XML features on the basis of their relevance to clustering, which is assessed through a new scoring scheme. A comparative experimentation on real-world benchmark XML corpora reveals a higher effectiveness than several state-of-the-art competitors.
Fully-Automatic XML Clustering by Structure-Constrained Phrases
Gianni Costa;Riccardo Ortale
2015
Abstract
Conventional approaches to XML clustering by content and structure are generally affected by a limitation due to the adoption of the bag-of-word model for the representation of their textual contents. This choice may lead to consider structure-constrained textual items of separate XML documents as related, even though the actual meaning of such items in their respective contexts is different. To overcome such a limitation, we propose XML clustering by structure-constrained phrases. The latter is a previously unexplored method relying on the more accurate bag-of-phrase model of the XML textual content, with which to better preserve the meaning of the structure-constrained content items for improved clustering effectiveness. In order to conduct an in-depth and systematic study of the effectiveness of the proposed method, we develop a parameter-free prototypical approach to XML partitioning, which projects the XML documents into a space of XML features representing fixed-length sequences of adjacent textual items in the context of root-to-leaf paths. Feature selection without any tunable threshold is used to choose a subset of the XML features on the basis of their relevance to clustering, which is assessed through a new scoring scheme. A comparative experimentation on real-world benchmark XML corpora reveals a higher effectiveness than several state-of-the-art competitors.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.