The PDF format represents the de facto standard for print-oriented documents. In this paper, we address the problem of wrapping PDF documents, which raises new challenges in several contexts of text data management. Our proposal is based on a novel bottom-up hierarchical wrapping approach that exploits fuzzy logic to handle the uncertainty which is intrinsic to the structure and presentation of PDF documents. A PDF wrapper is defined by specifying a set of group type definitions that impose a target structure to groups of tokens containing the required information. Constraints on token groupings are formulated as fuzzy conditions, which are defined on spatial and content predicates of tokens. We define a formal semantics for PDF wrappers and propose an algorithm for wrapper evaluation working in polynomial time with respect to the size of a PDF document. The proposed approach has been implemented in a wrapper generation system that offers visual capabilities to assist the designer in specifying and evaluating a PDF wrapper. Experimental results have shown good accuracy and applicability of our system to PDF documents of various domains. © 2006 IEEE.

A fuzzy logic approach to wrapping PDF documents

Masciari Elio;
2011

Abstract

The PDF format represents the de facto standard for print-oriented documents. In this paper, we address the problem of wrapping PDF documents, which raises new challenges in several contexts of text data management. Our proposal is based on a novel bottom-up hierarchical wrapping approach that exploits fuzzy logic to handle the uncertainty which is intrinsic to the structure and presentation of PDF documents. A PDF wrapper is defined by specifying a set of group type definitions that impose a target structure to groups of tokens containing the required information. Constraints on token groupings are formulated as fuzzy conditions, which are defined on spatial and content predicates of tokens. We define a formal semantics for PDF wrappers and propose an algorithm for wrapper evaluation working in polynomial time with respect to the size of a PDF document. The proposed approach has been implemented in a wrapper generation system that offers visual capabilities to assist the designer in specifying and evaluating a PDF wrapper. Experimental results have shown good accuracy and applicability of our system to PDF documents of various domains. © 2006 IEEE.
2011
Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
Adobe PDF
fuzzy logic
Information extraction
PDFWrap system
print-oriented documents
wrapping
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/261658
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? ND
social impact