Among the many and varied damages affectingancient documents, the penetration of ink from one side of thepage to the other is one of the most frequent and invasive. In thiswork, we are interested in binarizing such degraded documents,for the application of OCR or other automatic text analysistools, which can help philologists and palaeographers in texttranscription. We previously proposed a data model that roughlydescribes this damage for front-to-back documents, and used itto generate an artificial training set that can teach a shallowneural network how to classify pixels on both sides into clean orcorrupt. We show that this joint processing of the two sides of thedocument can significantly improve binarization and thereforeOCR and other text analysis tasks, compared to the separateprocessing of the single sides, using the same information.

Preprocessing of recto-verso printed documents based on neural networks for text analysis

Savino P;Tonazzini A
2024

Abstract

Among the many and varied damages affectingancient documents, the penetration of ink from one side of thepage to the other is one of the most frequent and invasive. In thiswork, we are interested in binarizing such degraded documents,for the application of OCR or other automatic text analysistools, which can help philologists and palaeographers in texttranscription. We previously proposed a data model that roughlydescribes this damage for front-to-back documents, and used itto generate an artificial training set that can teach a shallowneural network how to classify pixels on both sides into clean orcorrupt. We show that this joint processing of the two sides of thedocument can significantly improve binarization and thereforeOCR and other text analysis tasks, compared to the separateprocessing of the single sides, using the same information.
2024
Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo" - ISTI
978-1-6654-6133-7
Ancient document text analysis
Degraded document binarization
Optical character recognition
Recto-verso documents
Shallow multilayer neural networks
File in questo prodotto:
File Dimensione Formato  
prod_490208-doc_204218.pdf

accesso aperto

Descrizione: Preprint - Preprocessing of recto-verso printed documents based on neural networks for text analysis
Tipologia: Documento in Pre-print
Licenza: Nessuna licenza dichiarata (non attribuibile a prodotti successivi al 2023)
Dimensione 22.78 MB
Formato Adobe PDF
22.78 MB Adobe PDF Visualizza/Apri
Preprocessing_of_recto-verso_printed_documents_based_on_neural_networks_for_text_analysis.pdf

solo utenti autorizzati

Descrizione: Preprocessing of recto-verso printed documents based on neural networks for text analysis
Tipologia: Versione Editoriale (PDF)
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 22.83 MB
Formato Adobe PDF
22.83 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/452078
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact