Using the bag-of-word model to represent the textual data of XML documents may not be beneficial in XML clustering by content and structure. Indeed, the occurrence of similar structure-constrained textual items across distinct XML documents may enforce relatedness, despite the possibly different meaning implied by the order of item occurrence in the respective contexts. We propose XML clustering by structure-constrained phrases. It is a new method that better captures the meaning of the structure-constrained textual items of XML documents, by resorting to the more accurate bag-of-phrase model for improved clustering effectiveness. In order to conduct an in-depth and systematic study of the validity of the proposed method, we develop a parameter-free approach, that projects the XML documents into a space of XML features corresponding to sequences of textual items in the context of root-to-leaf paths. Automatic feature selection allows for choosing a subset of XML features, whose relevance is assessed through an innovative scoring scheme. The devised approach can operate with representations of the XML documents over both fixed-and mixed-length sequences of contextualized textual items. A novel criterion is presented to combine XML features with mixed lengths. A comparative experimentation on real-world benchmark XML corpora reveals the overcoming effectiveness of our approach. This highlights the potential of XML clustering by structure-constrained phrases and fosters further efforts. The scalability of the devised approach is also investigated.

XML Clustering by Structure-Constrained Phrases: A Fully-Automatic Approach Using Contextualized N-Grams

Riccardo Ortale
2017

Abstract

Using the bag-of-word model to represent the textual data of XML documents may not be beneficial in XML clustering by content and structure. Indeed, the occurrence of similar structure-constrained textual items across distinct XML documents may enforce relatedness, despite the possibly different meaning implied by the order of item occurrence in the respective contexts. We propose XML clustering by structure-constrained phrases. It is a new method that better captures the meaning of the structure-constrained textual items of XML documents, by resorting to the more accurate bag-of-phrase model for improved clustering effectiveness. In order to conduct an in-depth and systematic study of the validity of the proposed method, we develop a parameter-free approach, that projects the XML documents into a space of XML features corresponding to sequences of textual items in the context of root-to-leaf paths. Automatic feature selection allows for choosing a subset of XML features, whose relevance is assessed through an innovative scoring scheme. The devised approach can operate with representations of the XML documents over both fixed-and mixed-length sequences of contextualized textual items. A novel criterion is presented to combine XML features with mixed lengths. A comparative experimentation on real-world benchmark XML corpora reveals the overcoming effectiveness of our approach. This highlights the potential of XML clustering by structure-constrained phrases and fosters further efforts. The scalability of the devised approach is also investigated.
2017
Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
Semi-structured data analysis
XML clustering by structure and nested text
Structure-constrained phrases
contextualized word n-grams of fixed- and mixed-length
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/334012
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 10
  • ???jsp.display-item.citation.isi??? 9
social impact