We propose a novel approach to restoring digital document images, with the aim of improving text legibility and OCR performance. These are often compromised by the presence of artifacts in the background, derived from many kinds of degradations, such as spots, underwritings, and show-through or bleed-through effects. So far, background removal techniques have been based on local, adaptive filters and morphological-structural operators to cope with frequent low-contrast situations. For the specific problem of bleed-through/show-through, most work has been based on the comparison between the front and back pages. This, however, requires a preliminary registration of the two images. Our approach is based on viewing the problem as one of separating overlapped texts and then reformulating it as a blind source separation problem, approached through independent component analysis techniques. These methods have the advantage that no models are required for the background. In addition, we use the spectral components of the image at different bands, so that there is no need for registration. Examples of bleed-through cancellation and recovery of underwriting from palimpsests are provided.

Independent component analysis for document restoration

Tonazzini A;Salerno E
2004

Abstract

We propose a novel approach to restoring digital document images, with the aim of improving text legibility and OCR performance. These are often compromised by the presence of artifacts in the background, derived from many kinds of degradations, such as spots, underwritings, and show-through or bleed-through effects. So far, background removal techniques have been based on local, adaptive filters and morphological-structural operators to cope with frequent low-contrast situations. For the specific problem of bleed-through/show-through, most work has been based on the comparison between the front and back pages. This, however, requires a preliminary registration of the two images. Our approach is based on viewing the problem as one of separating overlapped texts and then reformulating it as a blind source separation problem, approached through independent component analysis techniques. These methods have the advantage that no models are required for the background. In addition, we use the spectral components of the image at different bands, so that there is no need for registration. Examples of bleed-through cancellation and recovery of underwriting from palimpsests are provided.
2004
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
Degraded documents
Blind source separation
Independent component analysis
document processing
File in questo prodotto:
File Dimensione Formato  
prod_68285-doc_22507.pdf

non disponibili

Descrizione: articolo publicato
Tipologia: Versione Editoriale (PDF)
Dimensione 2.28 MB
Formato Adobe PDF
2.28 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/79604
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 98
  • ???jsp.display-item.citation.isi??? ND
social impact