Lexicon-Grammar based open information extraction from natural language sentences in Italian

Guarasci, Raffaele; Damiano, Emanuele; Minutolo, Aniello; Esposito, Massimo; De Pietro, Giuseppe

doi:10.1016/j.eswa.2019.112954

In the last decade, the quantity of readily accessible text has grown rapidly and enormously, long exceeding the capacity of humans to read and understand it. One of the most interesting strategies proposed to fulfill this need is known as Open Information Extraction (OIE). It is essentially devised to read in sentences and rapidly extract one or more domain-independent coherent propositions, each represented by a verb relation and its arguments. Even though many OIE approaches exist for English, no significant research has been conducted about OIE on Italian texts. Due to the usage of language-specific features, OIE systems operating in other languages are not directly applicable for Italian. Therefore, this paper proposes, as first contribution, a novel approach to perform OIE for Italian language, based on standard linguistic structures to analyze sentences and on a set of verbal behavior patterns to extract information from them. These patterns are built combining a solid linguistic theoretical framework, i.e. Lexicon-Grammar (LG), and distributional profiles extracted from a contemporary Italian corpus, i.e. itWaC. Starting from simple sentences, the approach is able to determine elementary tuples, then, all their permutations, by adding complements and adverbials, and, finally, n-ary propositions, by granting syntactic invariance, preserving the overall grammaticality and also respecting some syntactic constraints and selection preferences, thus approximating a first level of semantic acceptability. As second contribution of this work, a gold standard dataset for the Italian language has been built from the itWaC corpus, aimed at being widely used to enable the experimental validation of OIE solutions. It has been manually and independently labeled by four Italian native speakers with all the n-ary propositions that can be extracted, following the criteria of grammaticality and acceptability, i.e. granting syntactic well-formedness and meaningfulness in the context. Finally, the proposed approach has been experimented and quantitatively validated on this gold standard dataset, also in comparison with an indirect approach translating input sentences and output propositions from Italian to English and vice versa and embedding an OIE approach for English, as well as with an OIE system for Italian previously presented by the authors. The results obtained have shown the effectiveness of the proposed approach in generating propositions with respect to these criteria of grammaticality and acceptability. Even if the approach has been evaluated for the Italian language, it is essentially based on linguistic resources produced by LG, which exist for many languages besides Italian and a representative corpus for the language under consideration. Given these premises, it has a general basis from a methodological perspective and can be proficiently extended also to other languages.