In this paper, we propose a novel approach for measuring words association based on the joint occurrences distribution in a text. Our approach relies on computing a sum of distances between neighboring occurrences of a given word pair and comparing it to a vector of randomly generated occurrences. The idea behind this assumption is that if the distribution of co-occurrences is close to random or if they tend to appear together less frequently than by chance, such words are not semantically related. We devise a distance function S that evaluates the words association rate. Using S, we build a concept-tree, which provides a visual and comprehensive representation of keywords association in a text. In order to illustrate the effectiveness of our algorithm, we apply it to three different texts, showing the consistency and significance of the obtained results with respect to the semantics of documents. Finally, we compare the results obtained by applying our proposed algorithm with the ones achieved by both human experts and the co-occurrence correlation method. We show that our method is consistent with the experts evaluation and outperforms with respect to the co-occurrence correlation method.

Automatic Detection of Words Associations in Texts based on Joint Distribution of Words Occurrences

Daniele Santoni;Elaheh Pourabbas
2016

Abstract

In this paper, we propose a novel approach for measuring words association based on the joint occurrences distribution in a text. Our approach relies on computing a sum of distances between neighboring occurrences of a given word pair and comparing it to a vector of randomly generated occurrences. The idea behind this assumption is that if the distribution of co-occurrences is close to random or if they tend to appear together less frequently than by chance, such words are not semantically related. We devise a distance function S that evaluates the words association rate. Using S, we build a concept-tree, which provides a visual and comprehensive representation of keywords association in a text. In order to illustrate the effectiveness of our algorithm, we apply it to three different texts, showing the consistency and significance of the obtained results with respect to the semantics of documents. Finally, we compare the results obtained by applying our proposed algorithm with the ones achieved by both human experts and the co-occurrence correlation method. We show that our method is consistent with the experts evaluation and outperforms with respect to the co-occurrence correlation method.
2016
Istituto di Analisi dei Sistemi ed Informatica ''Antonio Ruberti'' - IASI
Natural Language Processing
Words association
Co- occurrences distribution
Concept tree
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/227067
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact