A Quantitative/Qualitative Approach to {OCR} Error Detection and Correction
in Old Newspapers for Corpus-assisted Discourse Studies

Del Fante, Dario; Giorgio Maria Di Nunzio,

The use of OCR software to convert printed characters to digital text is a fundamental tool within diachronic approaches to Corpusassisted discourse Studies because allow researchers to expand their interest by making many texts available and analysable through a computer. However, OCR software are not totally accurate, and the resulting error rate compromises their effectiveness. This paper proposes a mixed qualitative-quantitative approach to OCR error detection and correction in order to develop a methodology for compiling historical corpora. The proposed approach consists of three main steps: corpus creation, OCR detection and correction, and application of the automatic rules. The rules are implemented in R using a "tidyverse" approach for a better reproducibility of the experiments.

A Quantitative/Qualitative Approach to {OCR} Error Detection and Correction in Old Newspapers for Corpus-assisted Discourse Studies

Dario Del Fante;Giorgio Maria Di Nunzio

2021

Abstract

The use of OCR software to convert printed characters to digital text is a fundamental tool within diachronic approaches to Corpusassisted discourse Studies because allow researchers to expand their interest by making many texts available and analysable through a computer. However, OCR software are not totally accurate, and the resulting error rate compromises their effectiveness. This paper proposes a mixed qualitative-quantitative approach to OCR error detection and correction in order to develop a methodology for compiling historical corpora. The proposed approach consists of three main steps: corpus creation, OCR detection and correction, and application of the automatic rules. The rules are implemented in R using a "tidyverse" approach for a better reproducibility of the experiments.

Scheda breve

Scheda completa

Scheda completa (DC)

Campo DC	Valore	Lingua
dc.authority.orgunit	Istituto di linguistica computazionale "Antonio Zampolli" - ILC	-
dc.authority.people	Dario Del Fante	it
dc.authority.people	Giorgio Maria Di Nunzio	it
dc.collection.id.s	71c7200a-7c5f-4e83-8d57-d3d2ba88f40d	*
dc.collection.name	04.01 Contributo in Atti di convegno	*
dc.date.accessioned	2024/02/19 19:37:58	-
dc.date.available	2024/02/19 19:37:58	-
dc.date.issued	2021	-
dc.description.abstracteng	The use of OCR software to convert printed characters to digital text is a fundamental tool within diachronic approaches to Corpusassisted discourse Studies because allow researchers to expand their interest by making many texts available and analysable through a computer. However, OCR software are not totally accurate, and the resulting error rate compromises their effectiveness. This paper proposes a mixed qualitative-quantitative approach to OCR error detection and correction in order to develop a methodology for compiling historical corpora. The proposed approach consists of three main steps: corpus creation, OCR detection and correction, and application of the automatic rules. The rules are implemented in R using a "tidyverse" approach for a better reproducibility of the experiments.	-
dc.description.affiliations	Università degli Studi di Padova Istituto di Linguistica Computazionale "A.Zampolli"	-
dc.description.allpeople	DEL FANTE, Dario; Maria Di Nunzio, Giorgio	-
dc.description.allpeopleoriginal	Dario Del Fante & Giorgio Maria Di Nunzio	-
dc.description.fulltext	none	en
dc.description.numberofauthors	2	-
dc.identifier.uri	https://hdl.handle.net/20.500.14243/417365	-
dc.identifier.url	http://ceur-ws.org/Vol-2816/paper5.pdf	-
dc.language.iso	eng	-
dc.relation.conferencedate	18-19/02/2021	-
dc.relation.conferencename	Proceedings of the 17th Italian Research Conference on Digital Libraries,Padua, Italy (virtual event due to the Covid-19 pandemic), February18-19, 2021	-
dc.relation.conferenceplace	Università degli Studi di Padova	-
dc.subject.keywords	OCR	-
dc.subject.keywords	OCR POST-PROCESSING CORRECTION	-
dc.subject.keywords	Historical Newspapers	-
dc.subject.singlekeyword	OCR	*
dc.subject.singlekeyword	OCR POST-PROCESSING CORRECTION	*
dc.subject.singlekeyword	Historical Newspapers	*
dc.title	A Quantitative/Qualitative Approach to {OCR} Error Detection and Correction in Old Newspapers for Corpus-assisted Discourse Studies	en
dc.type.driver	info:eu-repo/semantics/conferenceObject	-
dc.type.full	04 Contributo in convegno::04.01 Contributo in Atti di convegno	it
dc.type.miur	273	-
dc.type.referee	Sì, ma tipo non specificato	-
dc.ugov.descaux1	468772	-
iris.orcid.lastModifiedDate	2025/02/07 07:15:04	*
iris.orcid.lastModifiedMillisecond	1738908904051	*
iris.scopus.extIssued	2021	-
iris.scopus.extTitle	A quantitative/qualitative approach to OCR error detection and correction in old newspapers for corpus-assisted discourse studies	-
iris.sitodocente.maxattempts	2	-
Appare nelle tipologie:	04.01 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/417365

Citazioni

ND

ND

ND

CNR Institutional Research Information System

A Quantitative/Qualitative Approach to {OCR} Error Detection and Correction in Old Newspapers for Corpus-assisted Discourse Studies

Dario Del Fante;Giorgio Maria Di Nunzio

2021

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

Citazioni

social impact

CNR Institutional Research Information System

A Quantitative/Qualitative Approach to {OCR} Error Detection and Correction in Old Newspapers for Corpus-assisted Discourse Studies

Dario Del Fante;Giorgio Maria Di Nunzio

2021

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Citazioni

social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)