The use of OCR software to convert printed characters to digital text is a fundamental tool within diachronic approaches to Corpusassisted discourse Studies because allow researchers to expand their interest by making many texts available and analysable through a computer. However, OCR software are not totally accurate, and the resulting error rate compromises their effectiveness. This paper proposes a mixed qualitative-quantitative approach to OCR error detection and correction in order to develop a methodology for compiling historical corpora. The proposed approach consists of three main steps: corpus creation, OCR detection and correction, and application of the automatic rules. The rules are implemented in R using a "tidyverse" approach for a better reproducibility of the experiments.
A Quantitative/Qualitative Approach to {OCR} Error Detection and Correction in Old Newspapers for Corpus-assisted Discourse Studies
Dario Del Fante;
2021
Abstract
The use of OCR software to convert printed characters to digital text is a fundamental tool within diachronic approaches to Corpusassisted discourse Studies because allow researchers to expand their interest by making many texts available and analysable through a computer. However, OCR software are not totally accurate, and the resulting error rate compromises their effectiveness. This paper proposes a mixed qualitative-quantitative approach to OCR error detection and correction in order to develop a methodology for compiling historical corpora. The proposed approach consists of three main steps: corpus creation, OCR detection and correction, and application of the automatic rules. The rules are implemented in R using a "tidyverse" approach for a better reproducibility of the experiments.| Campo DC | Valore | Lingua |
|---|---|---|
| dc.authority.orgunit | Istituto di linguistica computazionale "Antonio Zampolli" - ILC | - |
| dc.authority.people | Dario Del Fante | it |
| dc.authority.people | Giorgio Maria Di Nunzio | it |
| dc.collection.id.s | 71c7200a-7c5f-4e83-8d57-d3d2ba88f40d | * |
| dc.collection.name | 04.01 Contributo in Atti di convegno | * |
| dc.date.accessioned | 2024/02/19 19:37:58 | - |
| dc.date.available | 2024/02/19 19:37:58 | - |
| dc.date.issued | 2021 | - |
| dc.description.abstracteng | The use of OCR software to convert printed characters to digital text is a fundamental tool within diachronic approaches to Corpusassisted discourse Studies because allow researchers to expand their interest by making many texts available and analysable through a computer. However, OCR software are not totally accurate, and the resulting error rate compromises their effectiveness. This paper proposes a mixed qualitative-quantitative approach to OCR error detection and correction in order to develop a methodology for compiling historical corpora. The proposed approach consists of three main steps: corpus creation, OCR detection and correction, and application of the automatic rules. The rules are implemented in R using a "tidyverse" approach for a better reproducibility of the experiments. | - |
| dc.description.affiliations | Università degli Studi di Padova Istituto di Linguistica Computazionale "A.Zampolli" | - |
| dc.description.allpeople | DEL FANTE, Dario; Maria Di Nunzio, Giorgio | - |
| dc.description.allpeopleoriginal | Dario Del Fante & Giorgio Maria Di Nunzio | - |
| dc.description.fulltext | none | en |
| dc.description.numberofauthors | 2 | - |
| dc.identifier.uri | https://hdl.handle.net/20.500.14243/417365 | - |
| dc.identifier.url | http://ceur-ws.org/Vol-2816/paper5.pdf | - |
| dc.language.iso | eng | - |
| dc.relation.conferencedate | 18-19/02/2021 | - |
| dc.relation.conferencename | Proceedings of the 17th Italian Research Conference on Digital Libraries,Padua, Italy (virtual event due to the Covid-19 pandemic), February18-19, 2021 | - |
| dc.relation.conferenceplace | Università degli Studi di Padova | - |
| dc.subject.keywords | OCR | - |
| dc.subject.keywords | OCR POST-PROCESSING CORRECTION | - |
| dc.subject.keywords | Historical Newspapers | - |
| dc.subject.singlekeyword | OCR | * |
| dc.subject.singlekeyword | OCR POST-PROCESSING CORRECTION | * |
| dc.subject.singlekeyword | Historical Newspapers | * |
| dc.title | A Quantitative/Qualitative Approach to {OCR} Error Detection and Correction in Old Newspapers for Corpus-assisted Discourse Studies | en |
| dc.type.driver | info:eu-repo/semantics/conferenceObject | - |
| dc.type.full | 04 Contributo in convegno::04.01 Contributo in Atti di convegno | it |
| dc.type.miur | 273 | - |
| dc.type.referee | Sì, ma tipo non specificato | - |
| dc.ugov.descaux1 | 468772 | - |
| iris.orcid.lastModifiedDate | 2025/02/07 07:15:04 | * |
| iris.orcid.lastModifiedMillisecond | 1738908904051 | * |
| iris.scopus.extIssued | 2021 | - |
| iris.scopus.extTitle | A quantitative/qualitative approach to OCR error detection and correction in old newspapers for corpus-assisted discourse studies | - |
| iris.sitodocente.maxattempts | 2 | - |
| Appare nelle tipologie: | 04.01 Contributo in Atti di convegno | |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


