Among the many and varied damages affecting ancient documents, the penetration of ink from one side of the page to the other is one of the most frequent and invasive. In this work, we are interested in binarizing such degraded documents, for the application of OCR or other automatic text analysis tools, which can help philologists and palaeographers in text transcription. We previously proposed a data model that roughly describes this damage for front-to-back documents, and used it to generate an artificial training set that can teach a shallow neural network how to classify pixels on both sides into clean or corrupt. We show that this joint processing of the two sides of the document can significantly improve binarization and therefore OCR and other text analysis tasks, compared to the separate processing of the single sides, using the same information.
Preprocessing of recto-verso printed documents based on neural networks for text analysis
Savino P;Tonazzini A
2023
Abstract
Among the many and varied damages affecting ancient documents, the penetration of ink from one side of the page to the other is one of the most frequent and invasive. In this work, we are interested in binarizing such degraded documents, for the application of OCR or other automatic text analysis tools, which can help philologists and palaeographers in text transcription. We previously proposed a data model that roughly describes this damage for front-to-back documents, and used it to generate an artificial training set that can teach a shallow neural network how to classify pixels on both sides into clean or corrupt. We show that this joint processing of the two sides of the document can significantly improve binarization and therefore OCR and other text analysis tasks, compared to the separate processing of the single sides, using the same information.File | Dimensione | Formato | |
---|---|---|---|
prod_490208-doc_204218.pdf
solo utenti autorizzati
Descrizione: Preprint - Preprocessing of recto-verso printed documents based on neural networks for text analysis
Tipologia:
Documento in Pre-print
Dimensione
22.78 MB
Formato
Adobe PDF
|
22.78 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.