The use of OCR software to convert printed characters to digital text is a fundamental tool within diachronic approaches to Corpusassisted discourse Studies because allow researchers to expand their interest by making many texts available and analysable through a computer. However, OCR software are not totally accurate, and the resulting error rate compromises their effectiveness. This paper proposes a mixed qualitative-quantitative approach to OCR error detection and correction in order to develop a methodology for compiling historical corpora. The proposed approach consists of three main steps: corpus creation, OCR detection and correction, and application of the automatic rules. The rules are implemented in R using a "tidyverse" approach for a better reproducibility of the experiments.

A Quantitative/Qualitative Approach to {OCR} Error Detection and Correction in Old Newspapers for Corpus-assisted Discourse Studies

Dario Del Fante;
2021

Abstract

The use of OCR software to convert printed characters to digital text is a fundamental tool within diachronic approaches to Corpusassisted discourse Studies because allow researchers to expand their interest by making many texts available and analysable through a computer. However, OCR software are not totally accurate, and the resulting error rate compromises their effectiveness. This paper proposes a mixed qualitative-quantitative approach to OCR error detection and correction in order to develop a methodology for compiling historical corpora. The proposed approach consists of three main steps: corpus creation, OCR detection and correction, and application of the automatic rules. The rules are implemented in R using a "tidyverse" approach for a better reproducibility of the experiments.
Campo DC Valore Lingua
dc.authority.orgunit Istituto di linguistica computazionale "Antonio Zampolli" - ILC -
dc.authority.people Dario Del Fante it
dc.authority.people Giorgio Maria Di Nunzio it
dc.collection.id.s 71c7200a-7c5f-4e83-8d57-d3d2ba88f40d *
dc.collection.name 04.01 Contributo in Atti di convegno *
dc.date.accessioned 2024/02/19 19:37:58 -
dc.date.available 2024/02/19 19:37:58 -
dc.date.issued 2021 -
dc.description.abstracteng The use of OCR software to convert printed characters to digital text is a fundamental tool within diachronic approaches to Corpusassisted discourse Studies because allow researchers to expand their interest by making many texts available and analysable through a computer. However, OCR software are not totally accurate, and the resulting error rate compromises their effectiveness. This paper proposes a mixed qualitative-quantitative approach to OCR error detection and correction in order to develop a methodology for compiling historical corpora. The proposed approach consists of three main steps: corpus creation, OCR detection and correction, and application of the automatic rules. The rules are implemented in R using a "tidyverse" approach for a better reproducibility of the experiments. -
dc.description.affiliations Università degli Studi di Padova Istituto di Linguistica Computazionale "A.Zampolli" -
dc.description.allpeople DEL FANTE, Dario; Maria Di Nunzio, Giorgio -
dc.description.allpeopleoriginal Dario Del Fante & Giorgio Maria Di Nunzio -
dc.description.fulltext none en
dc.description.numberofauthors 2 -
dc.identifier.uri https://hdl.handle.net/20.500.14243/417365 -
dc.identifier.url http://ceur-ws.org/Vol-2816/paper5.pdf -
dc.language.iso eng -
dc.relation.conferencedate 18-19/02/2021 -
dc.relation.conferencename Proceedings of the 17th Italian Research Conference on Digital Libraries,Padua, Italy (virtual event due to the Covid-19 pandemic), February18-19, 2021 -
dc.relation.conferenceplace Università degli Studi di Padova -
dc.subject.keywords OCR -
dc.subject.keywords OCR POST-PROCESSING CORRECTION -
dc.subject.keywords Historical Newspapers -
dc.subject.singlekeyword OCR *
dc.subject.singlekeyword OCR POST-PROCESSING CORRECTION *
dc.subject.singlekeyword Historical Newspapers *
dc.title A Quantitative/Qualitative Approach to {OCR} Error Detection and Correction in Old Newspapers for Corpus-assisted Discourse Studies en
dc.type.driver info:eu-repo/semantics/conferenceObject -
dc.type.full 04 Contributo in convegno::04.01 Contributo in Atti di convegno it
dc.type.miur 273 -
dc.type.referee Sì, ma tipo non specificato -
dc.ugov.descaux1 468772 -
iris.orcid.lastModifiedDate 2025/02/07 07:15:04 *
iris.orcid.lastModifiedMillisecond 1738908904051 *
iris.scopus.extIssued 2021 -
iris.scopus.extTitle A quantitative/qualitative approach to OCR error detection and correction in old newspapers for corpus-assisted discourse studies -
iris.sitodocente.maxattempts 2 -
Appare nelle tipologie: 04.01 Contributo in Atti di convegno
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/417365
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact