Using the bag-of-word model to represent the textual data of XML documents may not be beneficial in XML clustering by content and structure. Indeed, the occurrence of similar structure-constrained textual items across distinct XML documents may enforce relatedness, despite the possibly different meaning implied by the order of item occurrence in the respective contexts. We propose XML clustering by structure-constrained phrases. It is a new method that better captures the meaning of the structure-constrained textual items of XML documents, by resorting to the more accurate bag-of-phrase model for improved clustering effectiveness. In order to conduct an in-depth and systematic study of the validity of the proposed method, we develop a parameter-free approach, that projects the XML documents into a space of XML features corresponding to sequences of textual items in the context of root-to-leaf paths. Automatic feature selection allows for choosing a subset of XML features, whose relevance is assessed through an innovative scoring scheme. The devised approach can operate with representations of the XML documents over both fixed-and mixed-length sequences of contextualized textual items. A novel criterion is presented to combine XML features with mixed lengths. A comparative experimentation on real-world benchmark XML corpora reveals the overcoming effectiveness of our approach. This highlights the potential of XML clustering by structure-constrained phrases and fosters further efforts. The scalability of the devised approach is also investigated.
XML Clustering by Structure-Constrained Phrases: A Fully-Automatic Approach Using Contextualized N-Grams
Riccardo Ortale
2017
Abstract
Using the bag-of-word model to represent the textual data of XML documents may not be beneficial in XML clustering by content and structure. Indeed, the occurrence of similar structure-constrained textual items across distinct XML documents may enforce relatedness, despite the possibly different meaning implied by the order of item occurrence in the respective contexts. We propose XML clustering by structure-constrained phrases. It is a new method that better captures the meaning of the structure-constrained textual items of XML documents, by resorting to the more accurate bag-of-phrase model for improved clustering effectiveness. In order to conduct an in-depth and systematic study of the validity of the proposed method, we develop a parameter-free approach, that projects the XML documents into a space of XML features corresponding to sequences of textual items in the context of root-to-leaf paths. Automatic feature selection allows for choosing a subset of XML features, whose relevance is assessed through an innovative scoring scheme. The devised approach can operate with representations of the XML documents over both fixed-and mixed-length sequences of contextualized textual items. A novel criterion is presented to combine XML features with mixed lengths. A comparative experimentation on real-world benchmark XML corpora reveals the overcoming effectiveness of our approach. This highlights the potential of XML clustering by structure-constrained phrases and fosters further efforts. The scalability of the devised approach is also investigated.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.