This paper describes Armil, a meta-search engine that groups into disjoint labelled clusters the Web snippets returned by auxiliary search engines. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to her information need. Strik- ing the right balance between running time and cluster well- formedness was a key point in the design of our system. Both the clustering and the labelling tasks are performed on the °y by processing only the snippets provided by the auxil- iary search engines, and use no external sources of knowl- edge. Clustering is performed by means of a fast version of the furthest-point-¯rst algorithm for metric k-center cluster- ing. Cluster labelling is achieved by combining intra-cluster and inter-cluster term extraction based on a variant of the information gain measure. We have tested the clustering ef- fectiveness of Armil against Vivisimo, the de facto industrial standard in Web snippet clustering, using as benchmark a comprehensive set of snippets obtained from the Open Di- rectory Project hierarchy. According to two widely accepted \external' metrics of clustering quality, Armil achieves bet- ter performance levels by 10%. We also report the results of a thorough user evaluation of both the clustering and the cluster labelling algorithms. On a standard 1GHz ma- chine, Armil performs clustering and labelling altogether in less than one second.

Cluster Generation and Cluster Labelling for Web Snippets:A Fast and Accurate Hierarchical Solution

Geraci F;Pellegrini M;
2007

Abstract

This paper describes Armil, a meta-search engine that groups into disjoint labelled clusters the Web snippets returned by auxiliary search engines. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to her information need. Strik- ing the right balance between running time and cluster well- formedness was a key point in the design of our system. Both the clustering and the labelling tasks are performed on the °y by processing only the snippets provided by the auxil- iary search engines, and use no external sources of knowl- edge. Clustering is performed by means of a fast version of the furthest-point-¯rst algorithm for metric k-center cluster- ing. Cluster labelling is achieved by combining intra-cluster and inter-cluster term extraction based on a variant of the information gain measure. We have tested the clustering ef- fectiveness of Armil against Vivisimo, the de facto industrial standard in Web snippet clustering, using as benchmark a comprehensive set of snippets obtained from the Open Di- rectory Project hierarchy. According to two widely accepted \external' metrics of clustering quality, Armil achieves bet- ter performance levels by 10%. We also report the results of a thorough user evaluation of both the clustering and the cluster labelling algorithms. On a standard 1GHz ma- chine, Armil performs clustering and labelling altogether in less than one second.
2007
Istituto di informatica e telematica - IIT
Web snippets
Clustering
Meta-search engines
Metric spaces
Information gain
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/24781
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact