A Quantitative/Qualitative Approach to {OCR} Error Detection and Correction
in Old Newspapers for Corpus-assisted Discourse Studies

Del Fante, Dario; Giorgio Maria Di Nunzio,

The use of OCR software to convert printed characters to digital text is a fundamental tool within diachronic approaches to Corpusassisted discourse Studies because allow researchers to expand their interest by making many texts available and analysable through a computer. However, OCR software are not totally accurate, and the resulting error rate compromises their effectiveness. This paper proposes a mixed qualitative-quantitative approach to OCR error detection and correction in order to develop a methodology for compiling historical corpora. The proposed approach consists of three main steps: corpus creation, OCR detection and correction, and application of the automatic rules. The rules are implemented in R using a "tidyverse" approach for a better reproducibility of the experiments.

A Quantitative/Qualitative Approach to {OCR} Error Detection and Correction in Old Newspapers for Corpus-assisted Discourse Studies

Dario Del Fante;Giorgio Maria Di Nunzio

2021

Abstract

The use of OCR software to convert printed characters to digital text is a fundamental tool within diachronic approaches to Corpusassisted discourse Studies because allow researchers to expand their interest by making many texts available and analysable through a computer. However, OCR software are not totally accurate, and the resulting error rate compromises their effectiveness. This paper proposes a mixed qualitative-quantitative approach to OCR error detection and correction in order to develop a methodology for compiling historical corpora. The proposed approach consists of three main steps: corpus creation, OCR detection and correction, and application of the automatic rules. The rules are implemented in R using a "tidyverse" approach for a better reproducibility of the experiments.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2021
			
	Strutture organizzative
	
				Istituto di linguistica computazionale "Antonio Zampolli" - ILC
			
	Lingua/e
	
				Inglese
			
	Titolo del convegno
	
				Proceedings of the 17th Italian Research Conference on Digital Libraries,Padua, Italy (virtual event due to the Covid-19 pandemic), February18-19, 2021
			
	URL
	
				http://ceur-ws.org/Vol-2816/paper5.pdf
			
	Referee
	
				Sì, ma tipo non specificato
			
	Periodo del Convegno
	
				18-19/02/2021
			
	Luogo del Convegno
	
				Università degli Studi di Padova
			
	Parole chiave
	
				OCR
OCR POST-PROCESSING CORRECTION
Historical Newspapers
			
	Numero autori
	
				2
			
	Fulltext
	
				none
			
	Tutti gli autori
	
						DEL FANTE, Dario; Maria Di Nunzio, Giorgio
					
	Tipologia Login Miur
	
				273
			
	Tipologia
	
				info:eu-repo/semantics/conferenceObject
			
	Tipologia
	
				04 Contributo in convegno::04.01 Contributo in Atti di convegno
			
	Appare nelle tipologie:
	
				04.01 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/417365

Citazioni

ND

ND

ND

CNR Institutional Research Information System

A Quantitative/Qualitative Approach to {OCR} Error Detection and Correction in Old Newspapers for Corpus-assisted Discourse Studies

Dario Del Fante;Giorgio Maria Di Nunzio

2021

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

Citazioni

social impact

CNR Institutional Research Information System

A Quantitative/Qualitative Approach to {OCR} Error Detection and Correction in Old Newspapers for Corpus-assisted Discourse Studies

Dario Del Fante;Giorgio Maria Di Nunzio

2021

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Citazioni

social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)