Text classification is among the most broadly used machine learning tools in computational linguistic. Web information retrieval is one of the most important sectors that took advantage from this technique. Applications range from page classification, used by search engines, to URL classification used for focus crawling and on-line time-sensitive applications. Due to the pressing need for the highest possible accuracy, a supervised learning approach is always preferred when an adequately large set of training examples is available. Nonetheless, since building such an accurate and representative training set often becomes impractical when the number of classes increases over a few units, alternative unsupervised or semi-supervised approaches have come out. The use of standard web directories as a source of examples can be prone to undesired effects due, for example, to the presence of maliciously misclassified web pages. In addition, this option is subjected to the existence of all the desired classes in the directory hierarchy. Taking as input a textual description of each class and a set of URLs, in this paper we propose a new framework to automatically build a representative training set able to reasonably approximate the classification accuracy obtained by means of a manually-curated training set. Our approach leverages on the observation that a not negligible fraction of website names is the result of the juxtaposition of few keywords. Yet, the entire URL can often be converted into a meaningful text snippet. When this happens, we can label the URL by measuring its degree of similarity with each class description. The text contained in the pages corresponding to labelled URLs can be used as a training set for any subsequent classification task (not necessarily on the web). Experiments on a set of 20 thousand web pages belonging to 9 categories have shown that our auto-labelling framework is able to attain an approximation factor over 88% of the accuracy of a pure supervised classification trained with manually-curated examples.

Approximating Multi-Class Text Classiffication via Automatic Generation of Training Examples

F Geraci
2017

Abstract

Text classification is among the most broadly used machine learning tools in computational linguistic. Web information retrieval is one of the most important sectors that took advantage from this technique. Applications range from page classification, used by search engines, to URL classification used for focus crawling and on-line time-sensitive applications. Due to the pressing need for the highest possible accuracy, a supervised learning approach is always preferred when an adequately large set of training examples is available. Nonetheless, since building such an accurate and representative training set often becomes impractical when the number of classes increases over a few units, alternative unsupervised or semi-supervised approaches have come out. The use of standard web directories as a source of examples can be prone to undesired effects due, for example, to the presence of maliciously misclassified web pages. In addition, this option is subjected to the existence of all the desired classes in the directory hierarchy. Taking as input a textual description of each class and a set of URLs, in this paper we propose a new framework to automatically build a representative training set able to reasonably approximate the classification accuracy obtained by means of a manually-curated training set. Our approach leverages on the observation that a not negligible fraction of website names is the result of the juxtaposition of few keywords. Yet, the entire URL can often be converted into a meaningful text snippet. When this happens, we can label the URL by measuring its degree of similarity with each class description. The text contained in the pages corresponding to labelled URLs can be used as a training set for any subsequent classification task (not necessarily on the web). Experiments on a set of 20 thousand web pages belonging to 9 categories have shown that our auto-labelling framework is able to attain an approximation factor over 88% of the accuracy of a pure supervised classification trained with manually-curated examples.
2017
Istituto di informatica e telematica - IIT
text classification
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/332775
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact