If smartly utilized, Big Data locked in unstructured sources, such as PDF documents, can yield unprecedented insights in solving tough business issues, optimizing business processes and improving customer relations. The challenge addressed in this paper is to unlock the value held in data plunged in unstructured document. We describe how a contextual workflow based approach is used to address, in a semantic and flexible way, various problems arising in processing data contained into documents. We present the MANTRA Smart Data Platform, which enables to turn Big Data into Smart Data by means of contextual workflows composed by smart-cloud applications (APPs for short). Among the others, the MANTRA Language APP executes MANTRA rules that are able to extract and annotate information contained in heterogeneous sources (raw text, PDF, HTML or other presentation-oriented document format). Such rules exploit syntactic and semantic expressions, visual and spatial features, and natural language capabilities. Real cases of applications are showing that the proposed approach is able to process a large amount of heterogeneous input documents, as well as extract and consolidate the information of interest.
Using Apps and Rules in Contextual Workflows to Semantically Extract Data from Documents
Ermelinda Oro;Massimo Ruffolo
2015
Abstract
If smartly utilized, Big Data locked in unstructured sources, such as PDF documents, can yield unprecedented insights in solving tough business issues, optimizing business processes and improving customer relations. The challenge addressed in this paper is to unlock the value held in data plunged in unstructured document. We describe how a contextual workflow based approach is used to address, in a semantic and flexible way, various problems arising in processing data contained into documents. We present the MANTRA Smart Data Platform, which enables to turn Big Data into Smart Data by means of contextual workflows composed by smart-cloud applications (APPs for short). Among the others, the MANTRA Language APP executes MANTRA rules that are able to extract and annotate information contained in heterogeneous sources (raw text, PDF, HTML or other presentation-oriented document format). Such rules exploit syntactic and semantic expressions, visual and spatial features, and natural language capabilities. Real cases of applications are showing that the proposed approach is able to process a large amount of heterogeneous input documents, as well as extract and consolidate the information of interest.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


