If smartly utilized, Big Data locked in unstructured sources, such as PDF documents, can yield unprecedented insights in solving tough business issues, optimizing business processes and improving customer relations. The challenge addressed in this paper is to unlock the value held in data plunged in unstructured document. We describe how a contextual workflow based approach is used to address, in a semantic and flexible way, various problems arising in processing data contained into documents. We present the MANTRA Smart Data Platform, which enables to turn Big Data into Smart Data by means of contextual workflows composed by smart-cloud applications (APPs for short). Among the others, the MANTRA Language APP executes MANTRA rules that are able to extract and annotate information contained in heterogeneous sources (raw text, PDF, HTML or other presentation-oriented document format). Such rules exploit syntactic and semantic expressions, visual and spatial features, and natural language capabilities. Real cases of applications are showing that the proposed approach is able to process a large amount of heterogeneous input documents, as well as extract and consolidate the information of interest.

Using Apps and Rules in Contextual Workflows to Semantically Extract Data from Documents

Ermelinda Oro;Massimo Ruffolo
2015

Abstract

If smartly utilized, Big Data locked in unstructured sources, such as PDF documents, can yield unprecedented insights in solving tough business issues, optimizing business processes and improving customer relations. The challenge addressed in this paper is to unlock the value held in data plunged in unstructured document. We describe how a contextual workflow based approach is used to address, in a semantic and flexible way, various problems arising in processing data contained into documents. We present the MANTRA Smart Data Platform, which enables to turn Big Data into Smart Data by means of contextual workflows composed by smart-cloud applications (APPs for short). Among the others, the MANTRA Language APP executes MANTRA rules that are able to extract and annotate information contained in heterogeneous sources (raw text, PDF, HTML or other presentation-oriented document format). Such rules exploit syntactic and semantic expressions, visual and spatial features, and natural language capabilities. Real cases of applications are showing that the proposed approach is able to process a large amount of heterogeneous input documents, as well as extract and consolidate the information of interest.
2015
Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
Web Services Orchestration and Composition
Entity Extraction
Semantic Annotation
Data Extraction
Contextual Workflow
Language Rules
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/311039
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? ND
social impact