The use of OCR software to convert printed characters to digital text is a fundamental tool within diachronic approaches to Corpusassisted discourse Studies because allow researchers to expand their interest by making many texts available and analysable through a computer. However, OCR software are not totally accurate, and the resulting error rate compromises their effectiveness. This paper proposes a mixed qualitative-quantitative approach to OCR error detection and correction in order to develop a methodology for compiling historical corpora. The proposed approach consists of three main steps: corpus creation, OCR detection and correction, and application of the automatic rules. The rules are implemented in R using a "tidyverse" approach for a better reproducibility of the experiments.
A Quantitative/Qualitative Approach to {OCR} Error Detection and Correction in Old Newspapers for Corpus-assisted Discourse Studies
Dario Del Fante;
2021
Abstract
The use of OCR software to convert printed characters to digital text is a fundamental tool within diachronic approaches to Corpusassisted discourse Studies because allow researchers to expand their interest by making many texts available and analysable through a computer. However, OCR software are not totally accurate, and the resulting error rate compromises their effectiveness. This paper proposes a mixed qualitative-quantitative approach to OCR error detection and correction in order to develop a methodology for compiling historical corpora. The proposed approach consists of three main steps: corpus creation, OCR detection and correction, and application of the automatic rules. The rules are implemented in R using a "tidyverse" approach for a better reproducibility of the experiments.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.