In the last decade, the quantity of readily accessible text has grown rapidly and enormously, long exceeding the capacity of humans to read and understand it. One of the most interesting strategies proposed to fulfill this need is known as Open Information Extraction (OIE). It is essentially devised to read in sentences and rapidly extract one or more domain-independent coherent propositions, each represented by a verb relation and its arguments. Even though many OIE approaches exist for English, no significant research has been conducted about OIE on Italian texts. Due to the usage of language-specific features, OIE systems operating in other languages are not directly applicable for Italian. Therefore, this paper proposes, as first contribution, a novel approach to perform OIE for Italian language, based on standard linguistic structures to analyze sentences and on a set of verbal behavior patterns to extract information from them. These patterns are built combining a solid linguistic theoretical framework, i.e. Lexicon-Grammar (LG), and distributional profiles extracted from a contemporary Italian corpus, i.e. itWaC. Starting from simple sentences, the approach is able to determine elementary tuples, then, all their permutations, by adding complements and adverbials, and, finally, n-ary propositions, by granting syntactic invariance, preserving the overall grammaticality and also respecting some syntactic constraints and selection preferences, thus approximating a first level of semantic acceptability. As second contribution of this work, a gold standard dataset for the Italian language has been built from the itWaC corpus, aimed at being widely used to enable the experimental validation of OIE solutions. It has been manually and independently labeled by four Italian native speakers with all the n-ary propositions that can be extracted, following the criteria of grammaticality and acceptability, i.e. granting syntactic well-formedness and meaningfulness in the context. Finally, the proposed approach has been experimented and quantitatively validated on this gold standard dataset, also in comparison with an indirect approach translating input sentences and output propositions from Italian to English and vice versa and embedding an OIE approach for English, as well as with an OIE system for Italian previously presented by the authors. The results obtained have shown the effectiveness of the proposed approach in generating propositions with respect to these criteria of grammaticality and acceptability. Even if the approach has been evaluated for the Italian language, it is essentially based on linguistic resources produced by LG, which exist for many languages besides Italian and a representative corpus for the language under consideration. Given these premises, it has a general basis from a methodological perspective and can be proficiently extended also to other languages.

Lexicon-Grammar based open information extraction from natural language sentences in Italian

Raffaele Guarasci;Emanuele Damiano;Aniello Minutolo;Massimo Esposito;Giuseppe De Pietro
2020

Abstract

In the last decade, the quantity of readily accessible text has grown rapidly and enormously, long exceeding the capacity of humans to read and understand it. One of the most interesting strategies proposed to fulfill this need is known as Open Information Extraction (OIE). It is essentially devised to read in sentences and rapidly extract one or more domain-independent coherent propositions, each represented by a verb relation and its arguments. Even though many OIE approaches exist for English, no significant research has been conducted about OIE on Italian texts. Due to the usage of language-specific features, OIE systems operating in other languages are not directly applicable for Italian. Therefore, this paper proposes, as first contribution, a novel approach to perform OIE for Italian language, based on standard linguistic structures to analyze sentences and on a set of verbal behavior patterns to extract information from them. These patterns are built combining a solid linguistic theoretical framework, i.e. Lexicon-Grammar (LG), and distributional profiles extracted from a contemporary Italian corpus, i.e. itWaC. Starting from simple sentences, the approach is able to determine elementary tuples, then, all their permutations, by adding complements and adverbials, and, finally, n-ary propositions, by granting syntactic invariance, preserving the overall grammaticality and also respecting some syntactic constraints and selection preferences, thus approximating a first level of semantic acceptability. As second contribution of this work, a gold standard dataset for the Italian language has been built from the itWaC corpus, aimed at being widely used to enable the experimental validation of OIE solutions. It has been manually and independently labeled by four Italian native speakers with all the n-ary propositions that can be extracted, following the criteria of grammaticality and acceptability, i.e. granting syntactic well-formedness and meaningfulness in the context. Finally, the proposed approach has been experimented and quantitatively validated on this gold standard dataset, also in comparison with an indirect approach translating input sentences and output propositions from Italian to English and vice versa and embedding an OIE approach for English, as well as with an OIE system for Italian previously presented by the authors. The results obtained have shown the effectiveness of the proposed approach in generating propositions with respect to these criteria of grammaticality and acceptability. Even if the approach has been evaluated for the Italian language, it is essentially based on linguistic resources produced by LG, which exist for many languages besides Italian and a representative corpus for the language under consideration. Given these premises, it has a general basis from a methodological perspective and can be proficiently extended also to other languages.
2020
Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
open information extraction
Lexicon-Grammar
n-ary propositions
Natural language processing
Italian language
File in questo prodotto:
File Dimensione Formato  
published_final.pdf

solo utenti autorizzati

Tipologia: Versione Editoriale (PDF)
Licenza: Creative commons
Dimensione 2.31 MB
Formato Adobe PDF
2.31 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/373674
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 21
  • ???jsp.display-item.citation.isi??? 12
social impact